NOTES: attend Systems seminar on Monday – will be about recent paper submissions from UW
PORCUPINE
Goal: build a scalable application
- to a large cluster
- to a large workload
For Comparison: Hotmail:
- 2 billion legitimate messages a day
- 4 billion spams a day
- 200 million users
- 50 million logins / day
- Heterogeneous system: front door servers, redirectors, databases + backend stores
Requirements:
- Manageability: canÕt manually manage load. E.g. Hotmail/Google have 1 admin per thousand machines
- Availability: want highly available. Tolerate machine/software failures well. God service to all users all the time (e.g. no partial failures)
- Performance: want performance to scale linearly with # of machines
Solution:
- Leverage application properties
o Weak consistency semantics (e.g. internet can lose, reorder mail) (unlike a database or file system)
o Embarrassingly parallel (like the web)
o No single data item can overwhelm a machine (unlike a hot web page)
o QUESTION: other application examples?
¤ Answer: web queries. Can discard half the results and nobody notices
¤ Answer: Multimedia storage: no updates
- Functional homogeneity: all nodes can do all jobs (even those with persistent data!)
o QUESTION: what is benefit?
o ANSWER: can dynamically balance load between functions
o ANSWER: any node can fail and remaining nodes can take over too
- Dynamically scheduled: at run time choose where to send requests.
o QUESTION: what is benefit?
o ANSWER: can adapt to slow / full / fast hardware
- Replication: store data in multiple places
o Provides high availability
o With dynamic scheduling even better
- Soft / hard state difference
o Soft state: in-memory cache or index or table
o Hard state: persistent data, replicated on disk
- Automatic reconfiguration: manage node failure, addition
o Detects failures, reconfigures app to move soft state around
Tasks:
- Storing mail
o Mailboxes per user stored as a set of fragments
o No hard limit on number of fragments
o Spread is a soft limit (may be exceeded under failure)
o No fixed assignment of fragments to machines
- User accounts
o Partitioned user account database: each machine stores a piece of it
o Replicated for availability
- Finding mailbox
o Mail Fragment List == list of fragments
o Location of fragments is not stored separately – it is computed by scanning the local disk at boot. It is soft state
- Finding user profile
o Kept as soft state on one node. User database accessed and updated at that node
¤ Soft state – stored in memory from disk-based user database
o User map maps users onto nodes managing the user.
¤ Soft state: computed from hashing users + membership list
- Tracking membership
o Membership protocol detects failures, sets up agreement on who is alive
Operations:
- Delivering mail
o All machines are SMTP servers and act as a proxy
o Proxy looks up user in map, contacts profile owner for fragment list
o Proxy finds lightly loaded machine to store mail. If not found, can select a new machine (and tell profile owner)
o Proxy writes mail to selected machines
- Retrieving mail:
o All machines are IMAP / POP servers and act as a proxy
o Login goes to user profile machine to authenticate
o Proxy queries all fragment owners for mail, merges results and returns them
Benefits of this approach:
- high availability: can always deliver mail somewhere, can retrieve whatever mail is available
Replication
- store each fragment on two or more machines for availability
o QUESTION: what happens on a failure?
o ANSWER: fragments likely have lots of other replicas, avoids shifting all load to a single other node
- Use a log to record info to be replicated
- Properties:
o Update anywhere: no master
o Eventual consistency: temporary inconsistency but eventually everyone agrees (faster than always consistent, which requires locks)
o Total update: like AFS, avoids merging
o Lock free: no transactions just single object update
o Ordering by clocks: last writer wins (like AFS but with clocks)
- Leverages mail properties: email is really append-only
Failure Detection:
- When a node detects a change in membership, it becomes a coordinator and launches a Ònew groupÓ operation
- Send out new membership, set of replies is the final membership
- Send out final membership
- Coordinator sends heartbeat/probes periodically to detect changes
o When a coordinator receives a packet ˆ end of partition
o New node acts as coordinator, sends probe packets -> brought into existing partition
- Detect failures:
o timeout of remote operation
o Ping in ring order
- Problem: partially dead machines (happens to AFS)
o Machine responds to pings or membership operations but keeps timing out remote operations
Dynamic load balancing:
- Needed to handle failures to avoid overloading 1 other machine (e.g. reserve 50% of capacity for spare)
- Goal is to avoid any need for tuning
- Approach: each node tracks approximate load of all other machines
- On delivery, chooses lightly loaded of a small number of machines (the spread)
- QUESTION: how many?
o Answer: 2
o Avoids hot spots
o Using all machines causes global hot spots with outdated data
- How do you measure load?
o Look at queue length of requests
¤ But assumes similar processing times. For slower machines, may want to look instead at expected queue processing time
o Do you need up-to-date information?
¤ No. Proven theoretically
- Issue: lack of locality in network – need high cross-section bandwidth
Take away points:
- Functional homogeneity lets any node fail
o Can store data anywhere no matter what
o Can read mail from whatever is alive
- Soft state / hard state difference
o Can recalculate soft state after a failure
- Relaxed consistency
o Defines availability up – can be less correct and still be working
- Replication to overlapping groups
o spreads load on failures