Fault Tolerance

Disco

Disco

Context: Merge of Disco and Hive

1. Hive: OS that supports fault containment; allow some processes and processors/memory to fail (cellular approach)

2. Disco: virtualize large CC-NUMA to allow commodity OS to run

3. Done by people who created VMWare

Goals:

i. Run commodity, non-scalable OS on large CC-NUMA

1. QUESTION: Why?

ii. Efficiently manage resources

1. Multiplex CPU, Memory

2. Single machine to manage

3. Supports hot-replacement

4. Virtual machines did not support migration between nodes

5. Scalability: can scale VM up to limit of OS

iii. Contain OS crashes

iv. Contain HW failures

Approach

i. Virtual machines

1. Structure: VM / OS + VMM (hypervisor)

2. Run OS without full privilege

3. Privileged operations trap to HW, vectored to hypervisor

4. Emulate I/O devices: SCSI, Ethernet, kbd, mouse, display

a. Disco use IRIX device drivers – hosted (type 2)

b. Type 1 = hypervisor (trademarked by IBM), below all OSs, contain own drivers

c. Para-virtualization: similar interface to HW, but some changes, need to port OS

5. Emulate privileged portions of CPU

6. Requirements for full virtualization:

a. Any instruction that gives a different result between user/kernel mode must cause a trap

b. Virtualizable: Alpha

c. Not virtualizable: mips, x86

i. Work-around: binary-rewriting

ii. Paravirtualization

iii. Intel VT, AMD Pacifica

7. Benefit:

a. Can give OS fewer processors / memory than HW

b. Can relocate OS between memory / cpu

i. (avoid failed components)

Disco Approach

Dynamically split a large machine into many small ones

i. Approach used by Sun (Logical Domains), IBM (logical partitioning) but without dynamism

ii. QUESTION: what HW support required?

1. Hardware fault containment: limit fault to portion of system

2. Notify VMM what failed

Assume VMM software is trusted / reliable

i. QUESTION: is this reasonable?

ii. A: who knows? Only 50k lines of code (and growing)

Cells

i. Unit of HW failure

ii. CPU, Memory + attached devices

iii. Replicate VMM code (disco) to each cell

1. So can keep running if memory fails

iv. 4 CPU/cell in prototype

v. Cells trust each other: low overhead communication

Overall picture

i. Assign a number of VCPU’s, virtual devices, virtual memory to an OS

ii. Schedule OS when all non-idle VCPUs can run on real CPUs

iii. Allocate CPU, Memory from a single cell if possible

iv. QUESTION: what do you have to do to virtualize a multiprocessor?

1. QUESTION: can you schedule only a subset of the CPUs?

2. ANSWER: no, spinlocks don’t work well

3. SOLUTION: gang scheduling

v. QUESTION: how do you do dynamic resource sharing with cells?

vi. ANSWER: share resources within a cell, reduce reliability and share across cells

Performance

i. Scalability limited by guest OS. How get bigger applications?

ii. ANSWER: write as cluster apps (as in porcupine)

iii. SUPPORT: shared memory, fast communication (messages, RPC)

Scheduling: CPU management

i. GOAL: balance load across CPUs, ensure isolation

ii. MECHANISMS: migrate elsewhere: other CPU on node, other CPU in cell, other cell

iii. 2 level policy

1. Idle balancer: at idle time, looks for work to “steal” from other runqueues, start with closest and work further away (within a cell). Must obey gang scheduling: only schedule idelVCPU when all non-idle VCPUs of VM are runnable

a. Why does idle time exist? gang scheduling prevents VM from running if not enough CPU to run all VCPU

b. Uses local load information – can’t do a good global job of balancing

2. Periodic balancer: look at global load

a. QUESTION: how avoid polling hundreds of nodes?

b. ANSWER: use a tree data structure. Each node maintains sum of load below

c. HOW USE:

i. traverse tree look for imbalance L/R > 1 VCPU

ii. Look for VCPU to swap sides

iii. Constraint: must have CPU not running a VCPU from VM

iv. Constraint: try to avoid splitting VM across cells

3. Issue: migrating between cells leaves a VM vulnerable to faults in either one

a. Requires copying complete state of VM (e.g. megabytes of memory, I/O state) taking 10s of seconds

Memory Management: sharing memory across nodes

i. For NUMA: try to maintain locality

ii. For fault tolerance: avoid forcing VMs that fit in a cell to borrow memory

iii. Policy: when < 16MB memory, borrows 4 MB from each preferred cell for which it has < 4MB available

1. if have enough memory already, don’t seek more

iv. Policy: load freely when > 32 MB available

Paging:

i. Modify OS to identify unused pages to avoid swapping them

ii. Share pages between VMs – save on disk if swap page out

iii. Hook OS paging writes to see if VMM already swapped page; just update mapping in this case

1. QUESTION: why doesn’t VMware do this?

2. ANSWER: Windows doesn’t use a separate paging disk?

Fault tolerance

i. QUESTION: What can it tolerate?

1. HW fault to replicated HW. Maybe not disk, network

2. SW fault – just shutdown affected portion

ii. HW recovery operation:

1. QUESTION: What is goal?

2. ANSWER: identify, clear out affected state, restart processor

3. Steps:

a. Detect fault

b. Diagnose which state in error (e.g. which cache lines, memory locations).

i. NOTE: HW doesn’t cause failure until cache line accessed!

c. Notify Cellular Disco

iii. SW recovery operations

1. Agree on live set – still functioning cells (based on HW registers providing some info). Complexity: further failures during recovery

2. Unwedge communication: cancel stuck RPCs, messages

3. Terminate VM that had dependencies on failed cells

a. Scan all memory pages to look for bad cache lines

b. Choice of whether to kill early or wait for VM to access cache line

Other approaches:

Fixed partition (e.g. IBM, Sun) with HW support: good for small apps, fixed size slows dow large apps

Overview:

Fault detection: HW, OS crash
Fault isolation: disco partitions OS to only use some resources (e.g. CPU, memory)

i. Use Cell as a unit of isolation; bigger than a CPU to allow larger OS, but smaller t han whoe machine

Fault recovery: terminate affected portions