NOTES: attend Systems seminar on Monday – will be about recent paper submissions from UW

 

PORCUPINE

 

Goal: build a scalable application

-      to a large cluster

-      to a large workload

 

For Comparison: Hotmail:

-      2 billion legitimate messages a day

-      4 billion spams a day

-      200 million users

-      50 million logins / day

-      Heterogeneous system: front door servers, redirectors, databases + backend stores

 

Requirements:

-      Manageability: canÕt manually manage load. E.g. Hotmail/Google have 1 admin per thousand machines

-      Availability: want highly available. Tolerate machine/software failures well. God service to all users all the time (e.g. no partial failures)

-      Performance: want performance to scale linearly with # of machines

 

Solution:

-      Leverage application properties

o      Weak consistency semantics (e.g. internet can lose, reorder mail) (unlike a database or file system)

o      Embarrassingly parallel (like the web)

o      No single data item can overwhelm a machine (unlike a hot web page)

o      QUESTION: other application examples?     

¤       Answer: web queries. Can discard half the results and nobody notices

¤       Answer: Multimedia storage: no updates

 

-      Functional homogeneity: all nodes can do all jobs (even those with persistent data!)

o      QUESTION: what is benefit?

o      ANSWER: can dynamically balance load between functions

o      ANSWER: any node can fail and remaining nodes can take over too

-      Dynamically scheduled: at run time choose where to send requests.

o      QUESTION: what is benefit?

o      ANSWER: can adapt to slow / full / fast hardware

-      Replication: store data in multiple places

o      Provides high availability

o      With dynamic scheduling even better

-      Soft / hard state difference

o      Soft state: in-memory cache or index or table

o      Hard state: persistent data, replicated on disk

-      Automatic reconfiguration: manage node failure, addition

o      Detects failures, reconfigures app to move soft state around

 

 

Tasks:

-      Storing mail

o      Mailboxes per user stored as a set of fragments

o      No hard limit on number of fragments

o      Spread is a soft limit (may be exceeded under failure)

o      No fixed assignment of fragments to machines

-      User accounts

o      Partitioned user account database: each machine stores a piece of it

o      Replicated for availability

-      Finding mailbox

o      Mail Fragment List == list of fragments

o      Location of fragments is not stored separately – it is computed by scanning the local disk at boot. It is soft state

-      Finding user profile

o      Kept as soft state on one node. User database accessed and updated at that node

¤       Soft state – stored in memory from disk-based user database

o      User map maps users onto nodes managing the user.

¤       Soft state: computed from hashing users + membership list

-      Tracking membership

o      Membership protocol detects failures, sets up agreement on who is alive

 

Operations:

-      Delivering mail

o      All machines are SMTP servers and act as a proxy

o      Proxy looks up user in map, contacts profile owner for fragment list

o      Proxy finds lightly loaded machine to store mail. If not found, can select a new machine (and tell profile owner)

o      Proxy writes mail to selected machines

-      Retrieving mail:

o      All machines are IMAP / POP servers and act as a proxy

o      Login goes to user profile machine to authenticate

o      Proxy queries all fragment owners for mail, merges results and returns them

 

Benefits of this approach:

-      high availability: can always deliver mail somewhere, can retrieve whatever mail is available

 

Replication

-      store each fragment on two or more machines for availability

o      QUESTION: what happens on a failure?

o      ANSWER: fragments likely have lots of other replicas, avoids shifting all load to a single other node

-      Use a log to record info to be replicated

-      Properties:

o      Update anywhere: no master

o      Eventual consistency: temporary inconsistency but eventually everyone agrees (faster than always consistent, which requires locks)

o      Total update: like AFS, avoids merging

o      Lock free: no transactions just single object update

o      Ordering by clocks: last writer wins (like AFS but with clocks)

-      Leverages mail properties: email is really append-only

 

Failure Detection:

-      When a node detects a change in membership, it becomes a coordinator and launches a Ònew groupÓ operation

-      Send out new membership, set of replies is the final membership

-      Send out final membership

-      Coordinator sends heartbeat/probes periodically to detect changes

o      When a coordinator receives a packet ˆ end of partition

o      New node acts as coordinator, sends probe packets -> brought into existing partition

-      Detect failures:

o      timeout of remote operation

o      Ping in ring order

-      Problem: partially dead machines (happens to AFS)

o      Machine responds to pings or membership operations but keeps timing out remote operations

 

Dynamic load balancing:

-      Needed to handle failures to avoid overloading 1 other machine (e.g. reserve 50% of capacity for spare)

-      Goal is to avoid any need for tuning

-      Approach: each node tracks approximate load of all other machines

-      On delivery, chooses lightly loaded of a small number of machines (the spread)

-      QUESTION: how many?

o      Answer: 2

o      Avoids hot spots

o      Using all machines causes global hot spots with outdated data

-      How do you measure load?

o      Look at queue length of requests

¤       But assumes similar processing times. For slower machines, may want to look instead at expected queue processing time

o      Do you need up-to-date information?

¤       No. Proven theoretically

-      Issue: lack of locality in network – need high cross-section bandwidth

 

Take away points:

-      Functional homogeneity lets any node fail

o      Can store data anywhere no matter what

o      Can read mail from whatever is alive

-      Soft state / hard state difference

o      Can recalculate soft state after a failure

-      Relaxed consistency

o      Defines availability up – can be less correct and still be working

-      Replication to overlapping groups

o      spreads load on failures