Disco

 

 

  1. Disco
    1. Context: Merge of Disco and Hive

1.   Hive: OS that supports fault containment; allow some processes and processors/memory to fail (cellular approach)

2.   Disco: virtualize large CC-NUMA to allow commodity OS to run

3.   Done by people who created VMWare

    1. Goals:

                                              i.     Run commodity, non-scalable OS on large CC-NUMA

1.   QUESTION: Why?

                                             ii.     Efficiently manage resources

1.   Multiplex CPU, Memory

2.   Single machine to manage

3.   Supports hot-replacement

4.   Virtual machines did not support migration between nodes

5.   Scalability: can scale VM up to limit of OS

                                           iii.     Contain OS crashes

                                           iv.     Contain HW failures

    1. Approach

                                              i.     Virtual machines

1.   Structure: VM / OS + VMM (hypervisor)

2.   Run OS without full privilege

3.   Privileged operations trap to HW, vectored to hypervisor

a.     

4.   Emulate I/O devices: SCSI, Ethernet, kbd, mouse, display

a.    Disco use IRIX device drivers – hosted (type 2)

b.    Type 1 = hypervisor (trademarked by IBM), below all OSs, contain own drivers

c.    Para-virtualization: similar interface to HW, but some changes, need to port OS

5.   Emulate privileged portions of CPU

6.   Requirements for full virtualization:

a.    Any instruction that gives a different result between user/kernel mode must cause a trap

b.    Virtualizable: Alpha

c.    Not virtualizable: mips, x86

                                                                                                    i.     Work-around: binary-rewriting

                                                                                                   ii.     Paravirtualization

                                                                                                 iii.     Intel VT, AMD Pacifica

7.   Benefit:

a.    Can give OS fewer processors / memory than HW

b.    Can relocate OS between memory / cpu

                                                                                                    i.     (avoid failed components)

  1. Disco Approach
    1. Dynamically split a large machine into many small ones

                                              i.     Approach used by Sun (Logical Domains), IBM (logical partitioning) but without dynamism

                                             ii.     QUESTION: what HW support required?

1.   Hardware fault containment: limit fault to portion of system

2.   Notify VMM what failed

    1. Assume VMM software is trusted / reliable

                                              i.     QUESTION: is this reasonable?

                                             ii.     A: who knows? Only 50k lines of code (and growing)

    1. Cells

                                              i.     Unit of HW failure

                                             ii.     CPU, Memory + attached devices

                                           iii.     Replicate VMM code (disco) to each cell

1.   So can keep running if memory fails

                                           iv.     4 CPU/cell in prototype

                                            v.     Cells trust each other: low overhead communication

    1. Overall picture

                                              i.     Assign a number of VCPUŐs, virtual devices, virtual memory to an OS

                                             ii.     Schedule OS when all non-idle VCPUs can run on real CPUs

                                           iii.     Allocate CPU, Memory from a single cell if possible

                                           iv.     QUESTION: what do you have to do to virtualize a multiprocessor?

1.   QUESTION: can you schedule only a subset of the CPUs?

2.   ANSWER: no, spinlocks donŐt work well

3.   SOLUTION: gang scheduling

                                            v.     QUESTION: how do you do dynamic resource sharing with cells?

                                           vi.     ANSWER: share resources within a cell, reduce reliability and share across cells

    1. Performance

                                              i.     Scalability limited by guest OS. How get bigger applications?

                                             ii.     ANSWER: write as cluster apps (as in porcupine)

                                           iii.     SUPPORT: shared memory, fast communication (messages, RPC)

    1. Scheduling: CPU management

                                              i.     GOAL: balance load across CPUs, ensure isolation

                                             ii.     MECHANISMS: migrate elsewhere: other CPU on node, other CPU in cell, other cell

                                           iii.     2 level policy

1.   Idle balancer: at idle time, looks for work to ŇstealÓ from other runqueues, start with closest and work further away (within a cell). Must obey gang scheduling: only schedule idelVCPU when all non-idle VCPUs of VM are runnable

a.    Why does idle time exist? gang scheduling prevents VM from running if not enough CPU to run all VCPU

b.    Uses local load information – canŐt do a good global job of balancing

2.   Periodic balancer: look at global load

a.    QUESTION: how avoid polling hundreds of nodes?

b.    ANSWER: use a tree data structure. Each node maintains sum of load below

c.    HOW USE:

                                                                                                    i.     traverse tree look for imbalance L/R > 1 VCPU

                                                                                                   ii.     Look for VCPU to swap sides

                                                                                                 iii.     Constraint: must have CPU not running a VCPU from VM

                                                                                                 iv.     Constraint: try to avoid splitting VM across cells

3.   Issue: migrating between cells leaves a VM vulnerable to faults in either one

a.    Requires copying complete state of VM (e.g. megabytes of memory, I/O state) taking 10s of seconds

    1. Memory Management: sharing memory across nodes

                                              i.     For NUMA: try to maintain locality

                                             ii.     For fault tolerance: avoid forcing VMs that fit in a cell to borrow memory

                                           iii.     Policy: when < 16MB memory, borrows 4 MB from each preferred cell for which it has < 4MB available

1.   if have enough memory already, donŐt seek more

                                           iv.     Policy: load freely when > 32 MB available

    1. Paging:

                                              i.     Modify OS to identify unused pages to avoid swapping them

                                             ii.     Share pages between VMs – save on disk if swap page out

                                           iii.     Hook OS paging writes to see if VMM already swapped page; just update mapping in this case

1.   QUESTION: why doesnŐt VMware do this?

2.   ANSWER: Windows doesnŐt use a separate paging disk?

    1. Fault tolerance

                                              i.     QUESTION: What can it tolerate?

1.   HW fault to replicated HW. Maybe not disk, network

2.   SW fault – just shutdown affected portion

                                             ii.     HW recovery operation:

1.   QUESTION: What is goal?

2.   ANSWER: identify, clear out affected state, restart processor

3.   Steps:

a.    Detect fault

b.    Diagnose which state in error (e.g. which cache lines, memory locations).

                                                                                                    i.     NOTE: HW doesnŐt cause failure until cache line accessed!

c.    Notify Cellular Disco

                                           iii.     SW recovery operations

1.   Agree on live set – still functioning cells (based on HW registers providing some info). Complexity: further failures during recovery

2.   Unwedge communication: cancel stuck RPCs, messages

3.   Terminate VM that had dependencies on failed cells

a.    Scan all memory pages to look for bad cache lines

b.    Choice of whether to kill early or wait for VM to access cache line

  1. Other approaches:
    1. Fixed partition (e.g. IBM, Sun) with HW support: good for small apps, fixed size slows dow large apps
  2. Overview:
    1. Fault detection: HW, OS crash
    2. Fault isolation: disco partitions OS to only use some resources (e.g. CPU, memory)

                                              i.     Use Cell as a unit of isolation; bigger than a CPU to allow larger OS, but smaller t han whoe machine

    1. Fault recovery: terminate affected portions