PORCUPINE

NOTES: attend Systems seminar on Monday – will be about recent paper submissions from UW

PORCUPINE

Goal: build a scalable application

- to a large cluster

- to a large workload

For Comparison: Hotmail:

- 2 billion legitimate messages a day

- 4 billion spams a day

- 200 million users

- 50 million logins / day

- Heterogeneous system: front door servers, redirectors, databases + backend stores

Requirements:

- Manageability: can’t manually manage load. E.g. Hotmail/Google have 1 admin per thousand machines

- Availability: want highly available. Tolerate machine/software failures well. God service to all users all the time (e.g. no partial failures)

- Performance: want performance to scale linearly with # of machines

Solution:

- Leverage application properties

o Weak consistency semantics (e.g. internet can lose, reorder mail) (unlike a database or file system)

o Embarrassingly parallel (like the web)

o No single data item can overwhelm a machine (unlike a hot web page)

o QUESTION: other application examples?

§ Answer: web queries. Can discard half the results and nobody notices

§ Answer: Multimedia storage: no updates

- Functional homogeneity: all nodes can do all jobs (even those with persistent data!)

o QUESTION: what is benefit?

o ANSWER: can dynamically balance load between functions

o ANSWER: any node can fail and remaining nodes can take over too

- Dynamically scheduled: at run time choose where to send requests.

o QUESTION: what is benefit?

o ANSWER: can adapt to slow / full / fast hardware

- Replication: store data in multiple places

o Provides high availability

o With dynamic scheduling even better

- Soft / hard state difference

o Soft state: in-memory cache or index or table

o Hard state: persistent data, replicated on disk

- Automatic reconfiguration: manage node failure, addition

o Detects failures, reconfigures app to move soft state around

Tasks:

- Storing mail

o Mailboxes per user stored as a set of fragments

o No hard limit on number of fragments

o Spread is a soft limit (may be exceeded under failure)

o No fixed assignment of fragments to machines

- User accounts

o Partitioned user account database: each machine stores a piece of it

o Replicated for availability

- Finding mailbox

o Mail Fragment List == list of fragments

o Location of fragments is not stored separately – it is computed by scanning the local disk at boot. It is soft state

- Finding user profile

o Kept as soft state on one node. User database accessed and updated at that node

§ Soft state – stored in memory from disk-based user database

o User map maps users onto nodes managing the user.

§ Soft state: computed from hashing users + membership list

- Tracking membership

o Membership protocol detects failures, sets up agreement on who is alive

Operations:

- Delivering mail

o All machines are SMTP servers and act as a proxy

o Proxy looks up user in map, contacts profile owner for fragment list

o Proxy finds lightly loaded machine to store mail. If not found, can select a new machine (and tell profile owner)

o Proxy writes mail to selected machines

- Retrieving mail:

o All machines are IMAP / POP servers and act as a proxy

o Login goes to user profile machine to authenticate

o Proxy queries all fragment owners for mail, merges results and returns them

Benefits of this approach:

- high availability: can always deliver mail somewhere, can retrieve whatever mail is available

Replication

- store each fragment on two or more machines for availability

o QUESTION: what happens on a failure?

o ANSWER: fragments likely have lots of other replicas, avoids shifting all load to a single other node

- Use a log to record info to be replicated

- Properties:

o Update anywhere: no master

o Eventual consistency: temporary inconsistency but eventually everyone agrees (faster than always consistent, which requires locks)

o Total update: like AFS, avoids merging

o Lock free: no transactions just single object update

o Ordering by clocks: last writer wins (like AFS but with clocks)

- Leverages mail properties: email is really append-only

Failure Detection:

- When a node detects a change in membership, it becomes a coordinator and launches a “new group” operation

- Send out new membership, set of replies is the final membership

- Send out final membership

- Coordinator sends heartbeat/probes periodically to detect changes

o When a coordinator receives a packet à end of partition

o New node acts as coordinator, sends probe packets -> brought into existing partition

- Detect failures:

o timeout of remote operation

o Ping in ring order

- Problem: partially dead machines (happens to AFS)

o Machine responds to pings or membership operations but keeps timing out remote operations

Dynamic load balancing:

- Needed to handle failures to avoid overloading 1 other machine (e.g. reserve 50% of capacity for spare)

- Goal is to avoid any need for tuning

- Approach: each node tracks approximate load of all other machines

- On delivery, chooses lightly loaded of a small number of machines (the spread)

- QUESTION: how many?

o Answer: 2

o Avoids hot spots

o Using all machines causes global hot spots with outdated data

- How do you measure load?

o Look at queue length of requests

§ But assumes similar processing times. For slower machines, may want to look instead at expected queue processing time

o Do you need up-to-date information?

§ No. Proven theoretically

- Issue: lack of locality in network – need high cross-section bandwidth

Take away points:

- Functional homogeneity lets any node fail

o Can store data anywhere no matter what

o Can read mail from whatever is alive

- Soft state / hard state difference

o Can recalculate soft state after a failure

- Relaxed consistency

o Defines availability up – can be less correct and still be working

- Replication to overlapping groups

o spreads load on failures