Disco
1. Hive: OS that supports fault containment; allow some processes and processors/memory to fail (cellular approach)
2. Disco: virtualize large CC-NUMA to allow commodity OS to run
3. Done by people who created VMWare
i. Run commodity, non-scalable OS on large CC-NUMA
1. QUESTION: Why?
ii. Efficiently manage resources
1. Multiplex CPU, Memory
2. Single machine to manage
3. Supports hot-replacement
4. Virtual machines did not support migration between nodes
5. Scalability: can scale VM up to limit of OS
iii. Contain OS crashes
iv. Contain HW failures
i. Virtual machines
1. Structure: VM / OS + VMM (hypervisor)
2. Run OS without full privilege
3. Privileged operations trap to HW, vectored to hypervisor
a.
4. Emulate I/O devices: SCSI, Ethernet, kbd, mouse, display
a. Disco use IRIX device drivers – hosted (type 2)
b. Type 1 = hypervisor (trademarked by IBM), below all OSs, contain own drivers
c. Para-virtualization: similar interface to HW, but some changes, need to port OS
5. Emulate privileged portions of CPU
6. Requirements for full virtualization:
a. Any instruction that gives a different result between user/kernel mode must cause a trap
b. Virtualizable: Alpha
c. Not virtualizable: mips, x86
i. Work-around: binary-rewriting
ii. Paravirtualization
iii. Intel VT, AMD Pacifica
7. Benefit:
a. Can give OS fewer processors / memory than HW
b. Can relocate OS between memory / cpu
i. (avoid failed components)
i. Approach used by Sun (Logical Domains), IBM (logical partitioning) but without dynamism
ii. QUESTION: what HW support required?
1. Hardware fault containment: limit fault to portion of system
2. Notify VMM what failed
i. QUESTION: is this reasonable?
ii. A: who knows? Only 50k lines of code (and growing)
i. Unit of HW failure
ii. CPU, Memory + attached devices
iii. Replicate VMM code (disco) to each cell
1. So can keep running if memory fails
iv. 4 CPU/cell in prototype
v. Cells trust each other: low overhead communication
i. Assign a number of VCPUŐs, virtual devices, virtual memory to an OS
ii. Schedule OS when all non-idle VCPUs can run on real CPUs
iii. Allocate CPU, Memory from a single cell if possible
iv. QUESTION: what do you have to do to virtualize a multiprocessor?
1. QUESTION: can you schedule only a subset of the CPUs?
2. ANSWER: no, spinlocks donŐt work well
3. SOLUTION: gang scheduling
v. QUESTION: how do you do dynamic resource sharing with cells?
vi. ANSWER: share resources within a cell, reduce reliability and share across cells
i. Scalability limited by guest OS. How get bigger applications?
ii. ANSWER: write as cluster apps (as in porcupine)
iii. SUPPORT: shared memory, fast communication (messages, RPC)
i. GOAL: balance load across CPUs, ensure isolation
ii. MECHANISMS: migrate elsewhere: other CPU on node, other CPU in cell, other cell
iii. 2 level policy
1. Idle balancer: at idle time, looks for work to ŇstealÓ from other runqueues, start with closest and work further away (within a cell). Must obey gang scheduling: only schedule idelVCPU when all non-idle VCPUs of VM are runnable
a. Why does idle time exist? gang scheduling prevents VM from running if not enough CPU to run all VCPU
b. Uses local load information – canŐt do a good global job of balancing
2. Periodic balancer: look at global load
a. QUESTION: how avoid polling hundreds of nodes?
b. ANSWER: use a tree data structure. Each node maintains sum of load below
c. HOW USE:
i. traverse tree look for imbalance L/R > 1 VCPU
ii. Look for VCPU to swap sides
iii. Constraint: must have CPU not running a VCPU from VM
iv. Constraint: try to avoid splitting VM across cells
3. Issue: migrating between cells leaves a VM vulnerable to faults in either one
a. Requires copying complete state of VM (e.g. megabytes of memory, I/O state) taking 10s of seconds
i. For NUMA: try to maintain locality
ii. For fault tolerance: avoid forcing VMs that fit in a cell to borrow memory
iii. Policy: when < 16MB memory, borrows 4 MB from each preferred cell for which it has < 4MB available
1. if have enough memory already, donŐt seek more
iv. Policy: load freely when > 32 MB available
i. Modify OS to identify unused pages to avoid swapping them
ii. Share pages between VMs – save on disk if swap page out
iii. Hook OS paging writes to see if VMM already swapped page; just update mapping in this case
1. QUESTION: why doesnŐt VMware do this?
2. ANSWER: Windows doesnŐt use a separate paging disk?
i. QUESTION: What can it tolerate?
1. HW fault to replicated HW. Maybe not disk, network
2. SW fault – just shutdown affected portion
ii. HW recovery operation:
1. QUESTION: What is goal?
2. ANSWER: identify, clear out affected state, restart processor
3. Steps:
a. Detect fault
b. Diagnose which state in error (e.g. which cache lines, memory locations).
i. NOTE: HW doesnŐt cause failure until cache line accessed!
c. Notify Cellular Disco
iii. SW recovery operations
1. Agree on live set – still functioning cells (based on HW registers providing some info). Complexity: further failures during recovery
2. Unwedge communication: cancel stuck RPCs, messages
3. Terminate VM that had dependencies on failed cells
a. Scan all memory pages to look for bad cache lines
b. Choice of whether to kill early or wait for VM to access cache line
i. Use Cell as a unit of isolation; bigger than a CPU to allow larger OS, but smaller t han whoe machine