Wolfpack

Notes: telecom is 5 nines, not 7 nines

Microsoft’s first major attempt at high availability service (not fault tolerant!)

Big Points:

- What kinds of OS infrastructure do you need?

- What kinds of apps / client do you need?

- Big picture: lots of complexity in making clusters work

- This paper is about platform services for clusters

Approach: clusters

- Definition: collection of nodes work in concert to provide a more powerful / more reliable service

- Benefits:

o can grow larger than a single node can

o Can be more reliable

o Can be built from less expensive components

o NOTE: FT clusters built from fairly expensive parts

- MS approach: software clusters on commodity hardware

- All persistent state goes on a disk accessible to all cluster members

- All private state (e.g. volatile) must be consistent

- Clients see a single machine

Features:

- Support server applications (file servers, databases, web servers, SAP R3 servers)

- Recovery by migrating resources off a failed machine

- Shared nothing

o Dual ported disks, one machine uses it at a time

o Shared disk: multiple machines access disk at same time, use lock manager to negotiate

o Shared memory: e.g. SMP

- Unaware clients: migrate network addresses as well as services

QUESTION: what kind of failures are they targeting? HW? SW? Which SW?

Major components:

- membership management

o On awaken, nodes:

§ Check for other alive nodes

§ If none, form new cluster

§ If some, join existing cluster

o QUESTION: what makes it hard?

o ANSWER: failures during joining

§ Partition: may have two halves of cluster, both think they are the only survivors, make independent contradictory decisions

- Failure detection

o Heartbeats to services to detect failures

- Recovery

o Migrated services from failed machine elsewhere (or restart)

ABSTRACTIONS:

- Cluster: group of nodes providing some services

- Resource: functionality offered at a node

o example: printer, IP address, application, server share, web site

- Quorum resource: resource which, if owned, makes you part of the quorum so you can win elections (see membership)

- Resource Dependenices:

o resources may depend on each other (e.g. web site on database).

o Can’t restore a resource if dependencies not present

o Tracking dependencies lets you know what to restart, in what order

§ Like Apple LaunchD, MS Service Controller

- Resource Groups: explicitly named resources treated as a unit

o Simplifies management

Resource Management:

- resources have states

- Resources have dependencies

- Resource information maintained in a shared database – replicated everywhere via logs

- Resources implement a generic mgmt interface to allow them to be managed, migrated (start, stop)

- Migrating:

o Push: node containing a resource picks a place for it to go, pushes it there (with dependencies).

§ QUESTION: when does this work?

§ ANSWER: when node is healthy enough

o Pull: other nodes in cluster pull resources from a failed node

§ QUESTION: which node gets which resources?

§ ANSWER: all have same shared info, up to apps to decide

- Client access: via single network name that is a resources that migrates

o HOW: announce IP address via ARP

Membership management:

- Manages who is a member

- Operations:

o Join: 5 phase protocol

§ tell everyone else, tell the new node, once it joins, tell everyone the join completed, ack new member

§ WHY? must handle failures during the protocol.

o Regroup: on node failure to establish new membership

§ Trickiness: handling node failure, partition

§ Partitions:

· Must pick 1 partition to be the real one, kill others

· QUESTION: How decide?

o Majority of old membership

o Half members + tie breaker node from original cluster

o 1 node + quorum resource (a disk)

§ Parts of protocol:

· Test clock tick

· Determine which partition is winner

· Pruning: kill all nodes not connected

· Cleanup 1: notify others, filter requests from dead nodes

· Cleanup 2: second phase of cleanup (so have knowledge of how others have progressed)

· Stabilized

o HOW SLOW?

§ Join < 1.5 seconds for 1-12 nodes

§ Regroup: 2 seconds

- Global update manager

o For propagating shared state, assure everyone in same state

§ see state machine approach?

o Goal: atomic broadcast

§ When send a message, either all alive nodes here messages in same order, or nobody does

o Approach: lock

§ Grab a lock from lock node

§ Update other nodes in order

§ release lock

§ On failure: lock migrates to next alive node. If it doesn’t have data, nobody does …

o HOW SLOW?

§ 32 nodes ~ 6 small updates /sec

§ under load, 10 nodes -> 2-5 seconds to complete, breaks down with 12 nodes

SUPPORT SERVICES:

- heartbeat mgmt

- disk drivers: allows having a dual-ported disk

- cluster event logging

- time service

- virtual servers: encapsulate app state relative to a specific machine instead to a virtual OS so it can be migrated. E.g. computer name, address, registry, endpoints of other services

USES:

- SQL server

o Failover at machine level, not db level

o Benefit:

§ Not need identical machines (cluster handles that) so app settings can migrate

§ Handles all protocols (e.g replication), not just client access protocl (ODBC)

KEY APPROACH:

- Isolation (to a single machine)

- Fail fast (detect failure quickly with heartbeats)

- Fast recovery (restart on separate machine)

- Persistent shared state (on disk only)

ISSUES:

- only scales to two nodes

- What kind of performance do you get? What if you have a failure?