« Improving the Reliability of Commodity Operating Systems | Main | Password security: A case history »

Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors

Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosenblum. Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors. In Proceedings of 17th Symposium on Operating Systems Principles, 1999.

Reviews due Tuesday, 4/24.

Comments

Summary:

The paper presents cellular disco, a virtual machine monitor, which is used to manage large-scale shared-memory multiprocessors to provide a virtual cluster service. Cellular disco provides fine-grained resource sharing, overcommitting resources and some degree of fault tolerance and containment. The virtual machine monitor runs commodity operating systems on top without modification.

Problem addressed:

Growth in the number of large scale shared-memory multiprocessor systems was not being tapped properly because of software constraints as rewriting operating systems to accommodate large number of processors, variable memory access latency, hardware fault tolerance is a hard problem. On the other hand hardware partitioning is rigid. A flexible software solution was required which cellular disco provides.

Contributions:

The overall idea of using a virtual machine monitor as a layer beneath for simplification is an excellent idea. It provides software fault tolerance directly in case of an OS crash, overcommitting of resources and tighter resource balancing and scheduling. Also the fundamental advantage is to remove the requirement of modifying bloated commodity operating systems code which would take much more time and programming effort while providing a slight performance benefit.
All VMM advantages of heterogenity (use different operating systems, etc) are available.
The concept of cells as a fault isolation unit is a nice idea because when faults occur, they are localized into one or more cells. The rest of the system can work around the fault. There were additional insights gained from Hive operating system and the trusted layer of software was assumed.
Gang scheduling is well motivated.

Flaws:

The comparison of two different operating systems seems like a shortcut which probably isn't too serious but makes for bad reading.
Memory sharing across cells compromises fault tolerance as the authors point out.

Reliability:

Added availability by introducing fault tolerance in both hardware and software. The fault containment in the cells to keep the system available is the idea here.

SUMMARY
In "Cellular Disco: resource management using virtual cluster on shared-memory multiprocessors" Govil et al. propose a way of effectively utilizing shared memory multi-processor systems by using a VMM that creates a virtual cluster. This approach reduces development costs and offers desirable properties of clusters (e.g. fault containment and resource management).

PROBLEM
Shared memory multiprocessors is (was) a new technology whose potential wasn't fully utilized by contemporary operating systems because they needed significant modifications to do so. In particular existing OSs had poor scalability and were not able to effectively allocate resources and could not handle fault isolation.

CONTRIBUTIONS
* "Virtual cluster" on a single machine is a clever and elegant hack to ease transition to a new technology
* Providing cluster-like functionality (e.g. fault isolation) on a single physical machine
* Not taking the cluster analogy too far (e.g. leveraging advantages of running on the same machine like use of shared memory between cells)

FLAWS
* Since Cellular Disco is not completely transparent (still requires some software modification), ultimately it seems to be a temporary fix until OSes catch up.
* Performance and reliability experiments were run on different systems :)

RELIABILITY
Fault containment on a single machine (with redundant hardware). That reduces MTTF (of the entire system) in the sense that crippled system could continue to limp along instead of going down. Similarly, availability can be improved since although faults degrade the system, they do not necessarily disable the entire system.

Summary:
This paper describes a virtual machine monitor (VMM) called Cellular Disco which enables existing operating systems (OS) to run efficiently on multiprocessor, large shared-memory machine without modifying the OS. Two key aspects of Cellular Disco is providing fault containment and shared resource management to the OS, both by resource clustering and smart management.

Problem Addressed:
Available machines in the market were getting larger and larger in the perspective of number of processors and size of memory but there were no OS that could fully utilize those machines. Modifying an OS is the most straight forward approach but it is too difficult since the OS code was already very large and complicated. Therefore, an architecture that enables existing OS to use the resource efficiently without changing the code was needed. VMM called Disco was proposed but it still could have been optimized to the shared-memory multiprocessor machines.

Contributions:
By simply clustering the shared-memory and using benefit of multiprocessor, Cellular Disco provided fault containment and simple & flexible memory allocation without requiring OS modification. And this was able to do because Disco was designed to be directly on the hardware as VMM.
Resource management mechanism that allows migration of virtualized units and flexible memory allocation as borrowing, improved the flexibility and utilization of resource.
Also, by combining two type of CPU scheduling, it tries to schedule CPU efficiently and adaptively.

Possible Improvements:
Cellular Disco system is still not completely provides the benefit of not changing any existing programs. It is difficult to design good VMM without any cooperation with OS or application, but it could be better if there will be more less requirements on both software and hardware.
Virtual paging disk mechanism seems to be making the state management more complicated than the benefit for me. Of course it is important to improve performance too but it might make little difficult for Cellular Disco to manage.
Evaluation with two different environments for two different aspects seems to be difficult to justify the existence of both aspects in Cellular Disco at one time.

Reliability:
Cellular Disco improves the reliability by fault containment by using hardware partitioning. Since partitioning is loose and flexible than physically separate hardware, it also provides some flexibility with the usage and allocating the partition. Fault containment enables to shorten the time for recovery too.

Summary
Cellular Disco is a "large-scale shared-memory virtual cluster" service. What this means is that Cellular Disco is actually a virtual machine monitor designed for use on large systems with multiple processors. Virtual machine techniques are used to effectively create abstract machines that are organized into a virtual cluster. By choosing to go the VM route, Cellular Disco provides fault containment and efficient resource management, with little overhead.

Problem
While multiprocessor machines were available, development cost left most operating systems unable to utilize such machines. Additionally, no operating systems offered any sort of fault tolerance.

Contributions
VCPU load balancing via migration is pretty neat. The authors identify three classes of VCPU migration: moving to a different processor in the same node, moving to a different node within the same cell, and moving a VCPU across cell boundaries. Each of the movements does incur a hit due to copying over the required data, but the performance results seem to validate this design decision.

L2TLB allows for virtual memory to be mapped to machine memory quickly and easily.

The virtual paging disk concept works around the redundant paging problem that we saw in VMware. Cellular Disco simply gives the kernel a virtual paging disk that allows Cellular Disco to trap reads and writes, and becomes a simple matter of handling paging with the virtual machine monitor.

Memory borrowing seems like a pretty good solution for effective use of shared memory in a partitioned system. The high-level overview is that if a cell requires more memory than it has been given, the hungry cell may borrow memory from less needy cells.

I like the (apparently) pervasive use of VM priorities. It would pretty obviously allow for a good amount of flexibility when creating management policies.

Flaws
I am definitely in agreement with the other people saying that Cellular Disco seems to offset the difficulty and expense of writing multiprocessor-optimized code with writing cluster-optimized code.

Reliability
Cellular Disco pretty much has the reliability gains that we saw from Nooks. Faults are tolerated and recovered from by restarts, and do not affect other parts of the physical machine. The particularly nice thing about Cellular Disco is the minimal overhead, which seems like a reasonable cost given the benefits offered.

Summary
This paper presents cellular disco, a system for managing resources for virtual clusters on shared-memory multiprocessors. The basic idea is that an effective way to use a shared-memory multiprocessory machine is to create a virtual cluster in which resources can be moved around as needed.

Problem
The main problem that this paper addresses is the lack of support for effectively using the resources available on shared-memory multiprocessor machines. The main problem is that their exists poor scalability or poor ability to effectively move resources around partioned spaces.

Contributions
* Recognition of usefullness of virtual machine model for dividing up system resources effectively and also how that model can be used to allow better maximum performance than would be allowed with simple partitioning of hardware, thus providing a benefit of a multiprocessor machine over a cluster of uniprocessor machines.
* Use of isolation between virtual machines to provide protection from hardware faults to vm's not running on the affected hardware.

Flaws
While this work does allow for better scalability and flexibility from a multiprocessor machine than might otherwise be realized it seems that a major weakness of it, is that it does not allow for exploitation of parallelism for solving problems that have fundamentally parallel characteristics at a level which cannot be exploited by batch processing. With this technique you would have to build an mpi cluster among your virtual machines and this just seems silly and unnecessary!!

Reliability
While they are providing a reliability benefit in that one vm does not take down the rest due to a hardware fault, this does not seem like it provides any reliability benefits over a traditional cluster, it just keeps everything at the same level.

Summary
Cellular Disco provides fault isolation and resource sharing for systems with a large number of shared memory processors without modification to the operating system.

Problem
The number of processors was quickly increasing, but operating systems were not yet designed to handle fault isolation for a large numbers of processors. Without fault isolation somewhat common failures require a complete restart of the machine.

Contributions
Cellular Disco inserts a Virtual Machine Monitor layer just above the hardware. This layer allows Cellular Disco to dynamically change CPU and memory allocation. This allows for a borrowing system such that a cell not using all of its memory or processors can have them borrowed by another cell which is short resources. This is the strength of the article. It allows for isolation by keeping the cells independent in the common case, but also can dynamically increase the size of a cell when it is heavily utilized.

The article points out the drawbacks of too much sharing. By increasing the number of resources allocated to a cell the changes of a failure also increase. So resource allocation must balance the need for high utilization with isolation demands.

Moving CPUs between tasks was not a simple process. In order to allow often movement of CPU to achieve good balance the process of switching tasks must be fairly lightweight. A task change for a CPU resulted in a number of cache misses, which were not acceptable. To improve this the cache was saved and moved with the CPU. Moving the software TLB reallocating processors was much more practical.

Cellualr Disco also provided shared memory among the cells to make applications spanning many cells efficient. By using shared memory applications can avoid RPC calls for the common interactions between the cells.

Possible Improvements
The entire article is an economic argument that modifying the OS is too costly, but Cellular Disco requires modification of the applications. This defeats the economic argument since a great deal of effort will be required to modify each application to run on the Disco system. Not every application would need to span many cells, but enough would need to be custom written for Cellular Disco where the economic benefits of Cellular Disco are not nearly as great. The article uses an operating system designed for many shared processors as a benchmark for performance comparisons. It is unlikely that Disco will be able to outperform the operating system in the future. Without performance benefits and at best marginal economic benefits it does not seem like Cellular Disco would make sense for many people.

Also, again the performance and reliability experiments were run on separate systems. This doesn't allow for a good comparison, since the systems might vary greatly.

Reliability
The reliability improved in this article focuses on hardware failures. By isolating hardware dependencies a failed component can be better tolerated by applications running on the surviving hardware.

Summary:
The paper describes Cellular Disco, that turns a large-scale shared memory multiprocessor into a virtual cluster that provides fault containment and avoid OS scaleability bottlenecks.

Problem:
Though shared-memory multiprocessor systems with hundreds of processsors were commercially available, most existing OS couldnt scale to them because of complexity involved. The main problems included memory allocation (complex because of non-uniform memory access time) and fault containment (some hardware failure meant the whole system was down). The approaches taken to solve them included hardware partitioning (but it limits flexibility for allocating resources) or major changes in OS (but required major development effort). Disco attempted to solve this by using a virtual machine monitor to run unmodified commodity OS. Cellilar Disco extends Disco by providing hardware fault containement and aggressive global resource management and actually running on real hardware(though the prototype didnt - it ran on top of Irix).

Contributions:
* Concept of creating a virtual cluster on shared memory multiprocessors. You get the best of both worlds here - performance+flexibility of shared-memory multiprocessors and scalability+fault-containment of clusters.
* Hardware fault containment through the use of 'cells', which are the fault-containment-units.
* Overcommitting resources - We saw this in VmWare too, and it provides better resource utilization
* CPU balancing by allowing VCPUs to migrate (intra, inter)nodes or even inter cells. Cellular Disco also used 3 level CPU-scheduling policies (Idle, Periodic and a 'local periodic')
* Concept of "Borrowing Memory" - made sure applications are not limited to initial memory allocated to a cell (a major drawback of Hardware partitioning).


Flaws:
- It is not clear from the paper if it required changes in the OS that runs on top of VMM. It talks about things like "tracking OS memory usage and paging i/o" - wonder if this is done transparent to the OS.
- The virtual machine monitor allows applications running on virtual machines to bypass the OS and talk to each other. This allows performance, but makes applications dependent on the hardware directly. It would also be interesting to know, how the OSes allow applications to access memory hardware directly and also how the VMM handles load on that segment of memory (does its own swapping, etc).
- While measuring virtualization overhead, they compared performance in Irix 6.4 to that in Irix 6.2. They explained why they couldnt use 6.2 without VMM on that hardware, but why couldnt they use 6.4 above VMM as well?
- While borrowing memory, the paper said the policy is to borrow from all cells in its list from where the cell hasnt borrowed yet. Wont this complicate the dependencies and fault-recovery?
- Use of 'paging disk for virtual machines' - commodity OSes allow creation of paging store anywhere. This restriction means that the OS will have to be modified to use only this 'special disk'. I also wonder, how many of these 'special disks' would be needed on such a system, to avoid contention.

Relevance:
Cellular Disco attempted to do to shared-memory multiprocessors what OSes did to hardware decades ago - virtualize it. I guess it got similar results as well - abstraction that allowed code reuse above it (in this case, OSes themselves), fault management and resource sharing (through a level of indirection), and an overhead (of the Virtual Machine Monitor). Some of the finer details of implementation makes me think it may not scale as much as they estimated (VMM and its datastructures would become a bottleneck, use of special paging disk to make memory decisions will soon result in contention).

Summary

In this paper, the authors' describe their approach used to isolate and confine hardware faults incase of shared memory multi-processor system with very low execution time penalty using a virtual machine monitor layer on top of a number of semi-independent cells called Cellular Disco.

Problem Description

The inability of operating systems to survive any hardware or system software failures results in loss of all the applications running on the system, requiring the entire machine to be rebooted. Due to the development cost and the complexity of the changes required in the operating system to prevent these failures, the authors propose adding a virtual machine monitor layer on top of that provides hardware fault isolation.

Summary of Contribution

There are two main contributions of this paper.
1. Cellular Disco is internally structured into a number of semi-independent cells which act as fault-containment units and a virtual machine layer is added on top of it. As a result, the impact of most hardware failures is confined to a single node. The main advantage of using this approach is that the operating system is not required to be modified.
2. Another important feature provided by Disco is efficient resource management. Cellular Disco allows virtual machines to over-commit that actual physical resources present in the system. This leads to a significantly better utilization of the system.

Flaws

In order to fully limit impact of most faults to a small portion of the machine, Cellular Disco requires the system to have a hardware fault containment unit. As a result, Cellular Disco cannot be used on every system.

Reliability

Cellular Disco provides reliability using fault containment. Using the hardware fault confinement technique, impacts to the faults are confined to a small portion.

Summary:
The paper describes the architecture of Cellular Disco, a virtual cluster service that provides high performance while retaining some degree of ability to deal with faults, and with a simpler implementation than if the traditional OS were modified to get the same benefits.

Problem:
Existing operating systems weren't really able to take advantage of highly multiprocessor systems because modifications were too expensive. At least some systems (like Irix) could use them, but did not provide fault tolerance.

Contributions:
* Taking the idea of a cluster, and more or less virtualizing one on a highly multiprocessor machine
* Figuring out when to violate the strict abstraction of a cluster -- for instance, using shared memory instead of message passing (and a bunch of nonblocking algorithms), being able to share memory at all between cells, etc. (You could probably do sharing between traditional clusters too, but they were able to take advantage of the fact that the computer is essentially built for that)
* Some of the techniques I thought we pretty neat. For instance, running Irix in the background and world swapping to it so they could use the drivers in Irix was pretty cool. It's neat enough it's almost unfortunate that it's just an artifact of it being a prototype. Or the fact that they have deterministic scheduling algorithms they can run on each processor to select allocations without communicating is neat.
* Observing that gang scheduling is the right way to do things to avoid wasting cycles spinning; makes me wonder if something like VMware does this

Flaws
The evaluation... on one hand, it was pretty thorough, but at the same time, not. In particular, the comparison where they used both Irix 6.2 and 6.4 is sort of like the network card thing in Nooks, except that this one actually matters because they are directly comparing them. The redeeming thing is that it sounds like most of their benchmarks are being hurt by the differences between their versions.

Reliability
The paper seems to suggest that their fault containment idea presents a middle ground between a monolithic system and a fault tolerant one, but aside from not explicitly replicating things and operating only at the granularity/level of the OS, I'm not entirely sure how it differs from something that's actually fault tolerant. (Though now that I type that, maybe it IS the replication that distinguishes them.) But this is their approach -- give up complete fault tolerance because it is too high cost for most people, and trade it for isolating failures to their cell. Then they try to schedule resources to balance the performance need with the probability of a cell failure taking out a virtual machine.

Paper Review: Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors [Govil, et. al.]

Summary:

This paper describes Cellular Disco, a virtual machine monitor technique
that means to provide the advantages of hardware partitioning and fully
scalable operating systems, namely fault isolation and flexible resource management, respectively.

Problem:

The problem is that shared-memory multiprocessor machines were
available, but they were not being used to their full capability, and
none provided a fault tolerant environment - i.e. any hardware or
operating system failure could take down the entire system.

Contributions:

* The authors develop a system that has significant advantages over
fault tolerant clusters in that tasks could continue to run by migrating
fairly seemlessly to other nodes and cpus. Also, the administrators can
flexibly configure the balance of resources, such as CPU and memory,
amongst virtual machines - rather than having to assemble a set of
appropriately sized individual real machines, and spares.

* Contrary to previous work by some of the same authors (Hive), they
find that there is negligable performance cost to providing fault
containment using a virtual machine monitor.

Flaws:

* This work touts fault containment and tolerance as one of its two main
benefits, but the implementation that they evaluated compromises in a
number of ways that prevent complete evaluation: (1) it uses a
commercial OS as the basis for its monitor (ostensibly to borrow its
device drivers) which is a single point of software failure and (2) they
run on hardware without the requisite feartures for the monitor to
detect hardware failures.

* The justification is lacking for their using two different versions of Irix in the performance overhead evaluations; this is especially true
given that they note significant differences (such as blocking vs. spin
locks) in the two versions of the OS.

Reliability Impact:

If this architecture can be made to work well, it could provide better
"times-to-repair" - on the order of a half a second - yielding much better availability than clusters; and with with the added benefit of
flexible resource management, albeit at the cost of customizing or
probing into the guest operating systems in the virtual machines.
Still, its not clear from this work whether this sort of system can
acheive the fault isolation of a cluster of individual machines.

Summary:
A previously presented system for multi-processors is extended and expanded in this work to provide for better fault resilience in an environment where any single faulting processor can crash the whole system. Ideas from work done on computational clusters is levered and applied to a single multi-processor system.

Problems Addressed:
A big problem that exists when the number of processors within a system increases is that generally a fault with any single processor can bring down the whole machine. The problem is more important in a multi-processor environment since the likelihood of a single processor failure is greater then in a single processor environment and grows as the number of processors increases. Commodity operating systems are not good at dealing with hardware failures and must be modified to be fault tolerant, however this is expensive and hard to do with the large amount of code that makes up a modern OS. Operating systems also generally have an optimal set of resources that they are able to deal with efficiently and resources beyond this point become under utilized at times.

Contributions:
To reduce the likelihood of the operating system being affected by a failing processor the system is divided up into cells. A cell consists of a number of physical processors and forms the base of a virtual machine that supports the OS. Each cell thus has an operating system running within the cell's virtual machine. A hardware failure will only affect the cells that are using the resource that fails and not the whole machine. Unlike system partitioning strategies, which also can tolerate faults without a system wide crash, the cellular disco strategy can adjust and move physical resources around among the cells depending on workload. This prevents hot spots within the system where some physical resources my be overloaded and others sitting idle. Memory can also flow freely between cells depending on the individual cell needs. All the techniques used facilitate efficient use of the system resources while still maintaining the nice fault isolation properties of a partitioned system.

Flaw:
It would be interesting to understand how much memory overhead is required to run an operating system within each cell. Since all the cells are the same and run the same OS a lot of memory could probably be shared however the cells are not synchronized so there will be differences. Also the amount of processing overhead going to all the system calls in each OS seems like it could become significant.

Reliability:
The large number of processors becoming available to system architects is used in this work to basically build a cluster type environment within a single system. Many of the benefits of a cluster can be realized from this architecture to provide a more reliable system that does not completely crash at the sign of any single fault.

Post a comment