Wolfpack

 

Notes: telecom is 5 nines, not 7 nines

 

MicrosoftŐs first major attempt at high availability service (not fault tolerant!)

 

Big Points:

 

-      What kinds of OS infrastructure do you need?

-      What kinds of apps / client do you need?

-      Big picture: lots of complexity in making clusters work

-      This paper is about platform services for clusters

-       

 

Approach: clusters

-      Definition: collection of nodes work in concert to provide a more powerful / more reliable service

-      Benefits:

o      can grow larger than a single node can

o      Can be more reliable

o      Can be built from less expensive components

o      NOTE: FT clusters built from fairly expensive parts

-      MS approach: software clusters on commodity hardware

-      All persistent state goes on a disk accessible to all cluster members

-      All private state (e.g. volatile) must be consistent

-      Clients see a single machine

 

Features:

-      Support server applications (file servers, databases, web servers, SAP R3 servers)

-      Recovery by migrating resources off a failed machine

-      Shared nothing

o      Dual ported disks, one machine uses it at a time

o      Shared disk: multiple machines access disk at same time, use lock manager to negotiate

o      Shared memory: e.g. SMP

-      Unaware clients: migrate network addresses as well as services

 

QUESTION: what kind of failures are they targeting? HW? SW? Which SW?

 

Major components:

-      membership management

o      On awaken, nodes:

¤       Check for other alive nodes

¤       If none, form new cluster

¤       If some, join existing cluster

o      QUESTION: what makes it hard?

o      ANSWER: failures during joining

¤       Partition: may have two halves of cluster, both think they are the only survivors, make independent contradictory decisions

-      Failure detection

o      Heartbeats to services to detect failures

-      Recovery

o      Migrated services from failed machine elsewhere (or restart)

 

ABSTRACTIONS:

-      Cluster: group of nodes providing some services

-      Resource: functionality offered at a node

o      example: printer, IP address, application, server share, web site

-      Quorum resource: resource which, if owned, makes you part of the quorum so you can win elections (see membership)

-      Resource Dependenices:

o      resources may depend on each other (e.g. web site on database).

o      CanŐt restore a resource if dependencies not present

o      Tracking dependencies lets you know what to restart, in what order

¤       Like Apple LaunchD, MS Service Controller

-      Resource Groups: explicitly named resources treated as a unit

o      Simplifies management

 

 

Resource Management:

-      resources have states

-      Resources have dependencies

-      Resource information maintained in a shared database – replicated everywhere via logs

-      Resources implement a generic mgmt interface to allow them to be managed, migrated (start, stop)

-      Migrating:

o      Push: node containing a resource picks a place for it to go, pushes it there (with dependencies).

¤       QUESTION: when does this work?

¤       ANSWER: when node is healthy enough

o      Pull: other nodes in cluster pull resources from a failed node

¤       QUESTION: which node gets which resources?

¤       ANSWER: all have same shared info, up to apps to decide

-      Client access: via single network name that is a resources that migrates

o      HOW: announce IP address via ARP

 

 

Membership management:

-      Manages who is a member

-      Operations:

o      Join: 5 phase protocol

¤       tell everyone else, tell the new node, once it joins, tell everyone the join completed, ack new member

¤       WHY? must handle failures during the protocol.

o      Regroup: on node failure to establish new membership

¤       Trickiness: handling node failure, partition

¤       Partitions:

á      Must pick 1 partition to be the real one, kill others

á      QUESTION: How decide?

o      Majority of old membership

o      Half members + tie breaker node from original cluster

o      1 node + quorum resource (a disk)

¤       Parts of protocol:

á      Test clock tick

á      Determine which partition is winner

á      Pruning: kill all nodes not connected

á      Cleanup 1: notify others, filter requests from dead nodes

á      Cleanup 2: second phase of cleanup (so have knowledge of how others have progressed)

á      Stabilized

o      HOW SLOW?

¤       Join < 1.5 seconds for 1-12 nodes

¤       Regroup: 2 seconds

-      Global update manager

o      For propagating shared state, assure everyone in same state

¤       see state machine approach?

o      Goal: atomic broadcast

¤       When send a message, either all alive nodes here messages in same order, or nobody does

 

o      Approach: lock

¤       Grab a lock from lock node

¤       Update other nodes in order

¤       release lock

¤       On failure: lock migrates to next alive node. If it doesnŐt have data, nobody does É

o      HOW SLOW?

¤       32 nodes ~ 6 small updates /sec

¤       under load, 10 nodes -> 2-5 seconds to complete, breaks down with 12 nodes

 

SUPPORT SERVICES:

-      heartbeat mgmt

-      disk drivers: allows having a dual-ported disk

-      cluster event logging

-      time service

-      virtual servers: encapsulate app state relative to a specific machine instead to a virtual OS so it can be migrated. E.g. computer name, address, registry, endpoints of other services

 

USES:

-      SQL server

o      Failover at machine level, not db level

o      Benefit:

¤       Not need identical machines (cluster handles that) so app settings can migrate

¤       Handles all protocols (e.g replication), not just client access protocl (ODBC)

 

 

KEY APPROACH:

-      Isolation (to a single machine)

-      Fail fast (detect failure quickly with heartbeats)

-      Fast recovery (restart on separate machine)

-      Persistent shared state (on disk only)

 

ISSUES:

-      only scales to two nodes

-      What kind of performance do you get? What if you have a failure?