« Why Do Computers Stop and What Can Be Done About It? | Main | Manageability, Availability and Performance in Porcupine: A Highly Scalable Internet Mail Service »

Design and Architecture of the Microsoft Cluster Service -- A Practical Approach to High-Availability and Scalability

Werner Vogels, Dan Dumitriu, Ken Birman, Rod Gamache, Mike Massa, Rob Short, John Vert, Joe Barrera. The Design and Architecture of the Microsoft Cluster Service -- A Practical Approach to High-Availability and Scalability in Proceedings of the Fault-Tolerant Computing Symposium, 1998.

Reviews due Tuesday, 4/17.

Comments

Summary:

The paper presents an overview of Microsoft Cluster Service (MSCS) which extends the Windows NT operating system to support high availability services transparently or with minimal change to existing server applications.

Problem addressed:

A cluster can provide high availability and scalability if implemented correctly. It was intended to provide Windows NT with mechanisms for high availability via cluster services in software while keeping application compatibility.

Contributions:

The implementation of the cluster service utilizes some of the previously implemented ideas in Tandem and other highly available systems.
The cluster was a single highly available set of services which the authors claim had tighter integration with the OS design and structure.
Virtualization of the services fit naturally into the framework with the encapsulation easier to provide.
More often than not the distributed approach was preferred while deciding group dynamics.

Flaws:

The RPC model of interaction may not have been the best one. Message passing may have been more efficient.
The authors had to describe a fairly large system, they were not able to do justice to the components (managers and processes) of the system.

Reliability:

The primary discussion of the paper is high availability, ie the mean time to repair should be low via migration and redundancy.

In "The design and architecture of the Microsoft cluster service" Vogels et al. present just what the title promises. MSCS design goals focus on availability, use commodity hardware, scalability, transparency, reliability.


PROBLEM
How should one design a highly available cluster that is easy to manage, reliable (fault-tolerant), and can be built with commodity hardware?


CONTRIBUTIONS
* Discussion of MSCS design goals (ease of use, commodity, scalability, transparency, reliability)
* Identification of relevant areas that were not included in the design to limit scope (no development support, no running application migration support, no shared state recovery)
* Through description of cluster abstractions and operation
* Emphasis on ease of use
* Though not an evaluation per se, authors modified several commercial products to run on top of MSCS provided an Experience section.
* Distributed-like approach to node management (nodes join cluster by themselves)
* Emphasis on logging

FLAWS
* This being a technical report of a first-phase design and implementation, some interesting sections are missing. For example, a proper evaluation and related work sections.
* The statement "To be effective, the cluster must be as easy to program and manage as a single large computer" sound like management-speak. Although it's hard to disagree that easy is good, the statement in its context implies that ease-of-use compared to that of a single computer is necessary and sufficient condition for cluster technology to be effective. That is arguable.
* The authors employ "shared nothing" clustering model but also talk about shared resources (e.g. disks). I'm a bit confused about that.
* The system appears distributed in some respect (e.g. auto-join of nodes), but also very centralized (shared registry with globally consistent view). It seems to me like a bit of design dissonance that could lead to trouble.
* I found that the paper had too few high level and low level details that could help me better understand the system. People like me probably weren't in the target audience.
* Authors don't really offer a foothold for criticism. They basically say here it is and it works (based on Experiences section). Well, if it works, who am I to say there is something wrong with it.

RELIABILITY
The system goal here is high availability based on resource replication for fail-over and fast failure detection. It also seems that the authors envision future phases of their system to be truly geographically distributed and they anticipate such problems as group membership (who actually does the work), split brain syndrome and others.

Summary
MSCS is an attempt to bring machine clustering to the world. This paper acknowledges the difficulty of running existing clustering solutions. The authors then go on to explain how the various components of MSCS will alleviate said difficulty, hopefully leading to a glorious future filled with cluster aware apps.

Problem
Apparently administrating clusters of machines was a difficult problem. The authors posit that difficulty of administration has led to a dearth of cluster-aware applications.

Contributions
The focus on using commodity hardware with MSCS was a nice touch, and pretty future-proof too (efforts such as TerraServer, Google, etc).

Perhaps the focus on abstractions wasn't much of a contribution, but it certainly made the paper easy to grasp, and I expect the abstractions also work(ed) well to make MSCS more admin-friendly.

I thought that the juxtaposition of virtual servers and transparency was interesting, and a little strange. While I realize the benefits of virtual machines, I think (from an overhead, design point of view) that MSCS isn't really the right place to be throwing a virtual machine implementation.

Flaws
I certainly appreciated the relative simplicity of this paper, and I think MSCS is cool, this paper doesn't seem to be much more than marketing...

Much like everyone else, I'm curious as to how MSCS would scale.

I would've liked to learn what features are available to programmers to make MSCS-aware apps easier to write.

Reliability
There's not much in the way of hard numbers but intuitively it seems as though the comparatively easy ways of implementing redundancy would increase the reliability of the cluster.

Summary:
This paper introduces Microsoft Cluster Service which is a service of Windows NT system that increases higher availability by providing a abstracted resource formed by a group of nodes. Also the system provides simple interface so that users could use it.

Problem Addressed:
Even though clustering provides great availability to the system, if the system deployment and management is difficult, users will not use it. So a clustering mechanism that is easy to start, manage and provides high availability was needed.

Contributions:
The way they abstracted and managed the resource of the cluster is clear and straightforward so that will accommodate the mechanism to construct a cluster very easily and provide user transparency.
The way they designed how each node joins, creates, leave, or crashes the system is automated in many stages and this makes easier to use for users. Windows NT system was already a common system and therefore it has a large number of potential users already and this also made it easier to start than other clustering mechanism.
Virtual Server mechanism provides layer of indirection to users so that users could access to the resource via a single virtual address to use clustering service and gain high availability without specially designed application because the virtual server will migrate resources transparently in case of crash.
Example of implementing well known application encourages users to start using the MSCS and also clarifies the future directions.

Possible Improvements:
To create a simple cluster management, many states are logically centralized and if all of these modules were physically on one node, this will affect the scalability and also risk of whole cluster failure.
Since this is an introductory paper, most of the mechanisms and policies are explained briefly, I believe that there was technical information somewhere but still it might be better if we were able to see some real interface design and so on.
Even though it is difficult to get a real environment availability evaluation, some performance evaluation based on the comparison of MTTR might be interesting.

Reliability:
MSCS provides reliability by redundancy of resources which is abstracted at the user interface level. Service migration mechanism supports this redundancy and also automated setup and configuration makes the service easier to start using. So it is much more like cluster for everyone comparing to other clustering service.

Summary
The Microsoft Cluster Service allows for high availability systems through reducing MTTR by automatically detecting and restarting failed applications. The main contribution is a simpler interface to use the cluster.

Problem
Clustering is a great idea, but without becoming simple and easy to use they will never be adopted by the mainstream.

Contributions
Their modeling of resources seems somewhat clever. A tree is built with the application at the top and contributing resources under the application. So a whole group of needed resources is created for each application. This simplifies what needs to be moved in the failover case.

They present their approach to cluster membership management. Joining involves finding a neighboring node, who is already a member and willing to become a sponsor for the joining nodes. Heartbeats are used in order to detect node failures.

Access paths, DNS names, and IPs are migrated with resources. This allows for transparency from the point of view of a client. The client does not need to know where resources are physically located, because it knows that the methods to access resources will move with the resource.

Many other ideas are presented. All updates to the cluster services are forced through a single 'locker node' in order to enforce serializability and also to guarantee that all members will receive the update. The registry is modified so that the registry an application sees is a subtree of the main registry. This allows for virtualization, by presenting applications a consistent view of their original registry options as they move from machine to machine. Also a single log holds logging information from every node. Allowing for easy debugging.

Possible Improvements
The ideas of the paper are well explained and solve their goal of simplifying the cluster interface. Maybe it is due to such good explanations, but I felt like the article could have used a few more novel ideas. Many of the items presented such as the resource management, membership management, failover procedure were sound solutions, but seemed more like engineering solutions than research solutions. I was not wowed by any new and clever algorithms or approaches in the article. Even so I am sure the end system was a great improvement over a single PC.

The paper might have been a bit stronger if it were published after a bit more work had been completed. At the time of publication a cluster consisted of two nodes with the quorum and file locker relying on a single machine, along with a logging system that forwards log events to all other machines. Scaling beyond two nodes and more elegant distributed solutions to some of the problems would make for a more interesting paper.


Reliability
Machine failure is what is being considered here. By automatically restarting applications on a different machine, with the needed resources, the MTTR is decreased and availability is increase.

Summary
This paper presents Microsoft Cluster Service (MSCS), which has the goal of supporting high-availability services.

Problem
Cluster technology promises(promised) to allow services to spread over many commodity machines and over several geographic locations. However, in order for this to happen ot is necessary that cluster technology improves to allow applications to be "automatically launched."

Contributions
* Design goals for MSCS of running on commodity hardware, scalability, transparency and reliability.
* Abstractions of node, resource, quorum resource, resoruce dependencies, resource groups, and a cluster database.
* Cluster operation activities and management
* Virtual server abstractio n and presentation to clients as a "single stable environment"
* Application example with MS SQL Server.

Flaws
The main flaws are the lack of evaluation and justification. Though maybe these are not as necessary since they are betting with their money.

Reliability
The impact that this work has on reliability is the use of multiple instances of the same application to provide reliability.

Summary:
The paper discusses about Microsoft Cluster Service, a feature in Windows NT that supports high-availability services.

Problem:
Clusters provide better fault-tolerance and performance than a single fast computer. However, users often found it hard to program applications for clusters, due to the interdependencies between applications (order of startup, for e.g.), wide area integration issues, etc. It is also difficult to configure clusters with desired management and security properties. MSCS attempted to solve these issues in a phased manner - starting with high availability to some well-known services and providing simpler user interface and a sophisticated application model.

Contributions:
- Abstraction of entities in a cluster - Node, Resources, Resource Groups, Resource Dependencies, Cluster Database.
- Clusters using commodity everything - operating system, hardware, networks etc
- Easy addition/removal of nodes and resources - node is just another computer and resource is another DLL (Resource Control Program Library)
- Virtual servers - extension of resource group and also an extension of Vritual Machines. Allows a VM to migrate to another node in case of failures.

Flaws:
- The paper is at a very high-level. It isnt clear how some of the operations really work. For e.g. Resource Migration - how is application state recreated at the new Node? Is there some kind of transaction mechanisms?
- Some of the mechanisms employed, like regroup, seems not well designed for wide area network usage. It will take forever before the cluster settles again.
- Again on scalability - there are too many 'central' entities - quorum resource, lock manager, membership manager. They all can have fail-over, but failover for such components usually take longer to recover
- It would also be interesting to see the scalability aspects even on local area, with contention for quoram resource.
- It is not clear on what happens to services on regroup - if an active nodes happen to be on a partition that doesnt survive. Do they get migrated to another node?

Relevance:
The MSCS seems to be another typical Microsoft solution - easy to use, not necessarily the best of engineering. It makes it easy for programmers to use clusters. But scalability aspects and effectiveness of migration remain to be understood better.


Summary:
This paper describes an early version of MS's Cluster Service, both the architecture and a couple programs that were made to work with it.

Problem:
Previous clusters exist, but the paper claims they were hard to administer well.

Contributions:
* A cluster that presented itself to the outside world as a single box that was highly available and high-performing.
* I don't know if the idea was around before, but the virtual servers thing was pretty nifty. It seems like a slightly different form of the idea of moving around virtual machines between physical machines.
* Real world demos of servers that were made to work with the MSCS better than it did before
* DLLs to provide code to detect when a resource went missing as opposed to some more generic solution that might not always work -- more flexible, but maybe more complex.

Flaws:
I don't know how scalable their system would be under that architecture. They got it working for two machines. They have a few places where there are single machine bottlenecks, and the more sharing of resources the more this is an issue. I doubt there is any way that they'd, say, scale to a Google cluster without major architecture changes. Then again, probably nothing else at the time would have either, so I don't know how big of an issue this is.

They are a little lax in explaining some things. For instance, how long would migration take? Or in the member join protocol described on p. 6 (below the table), when the sponsor sends out a notification to the other nodes, does it wait for a reply before responding? It seems like it does, but it's not clear. Or how much synchronization is there in the ordering of events in the log?

Reliability:
Basically, MS was trying to increase availability by having the system notice that services had gone down and restarting them. They also had a secondary benefit though, which was that by presenting these uniform views like virtual servers they made it easier to develop failover solutions than it was before.

Summary

This paper mainly presents the architecture of MSCS and various features supported by it.

Problem Description

People find it quite difficult to configure clusters with the desired management and security properties. Secondly, it is quite difficult to configure applications to be automatically launched in an appropriate order. In order to handle these problems, MSCS is presented.

Summary of Contributions

1. MSCS is designed around the abstraction of nodes, resources, resources dependencies, and resource groups. Providing these abstractions makes it much easier to configure the cluster and to transfer different applications from one node to the other. Secondly, providing these abstractions hides the cluster specific details from the programmer which makes it easier to develop applications that can efficiently utilize the cluster.
2. MSCS also provides virtual NT servers. Running applications in a virtual environment provides an application with an illusion that it is running on a virtual node with a virtual name space, virtual services, and virtual register. As a result, when an application migrates from one node to another, it appears to it as if it has restarted at the same virtual node.

Flaws

1. The authors do not evaluate their design at all. There is not discussion on what type of overheads are added by adding various abstraction layers.
2. The paper seemed more like providing definitions for various terms used in cluster computing rather than explaining why they are required and what is their potential benefit.

Reliability

The paper mainly suggests redundancy to achieve reliability. If a node of the cluster fails, then the tasks running on that node are transferred to another healthy node. The only problem with this approach is that if a node controlling important resources goes down then the applications relying on those resources might not complete.

Summary:
This paper presents the Microsoft Cluster Service and discusses the primary components that make up the first release of the system. A few brief case studies of implementing commercial software on the cluster are explored to demonstrate the versatility and importance of the service.

Problems Addressed:
The primary goal of the cluster service is to allow commodity off the shelf hardware to be implemented in a cooperative and transparent manner to provide a robust and scalable server. The management of a cluster is however a big concern of the authors along with the ability to match cluster resources to applications. Issues relating to resource dependence, failing nodes, and resource migration all are important aspects of a cluster system and must be dealt with by the service.

Contributions:
A cluster consists of a number of actively participating nodes that present a single image to its clients. Clients should not be aware that they are interacting with a cluster instead of a single machine and in the case of a failing server should experience only a very brief service interruption while the resource is migrated to a new node within the cluster. Clients are really not concerned with what machine they need to connect to but rather are only interested in what service they need. The cluster model addresses this very well by hiding server identity and only exposing services to the client. Nodes within the cluster are then free to move resources around among the participating nodes to improve performance or tolerate failures. The cluster architecture is broadly broken down into nodes and resources. A single participating server in a cluster is called a node and a resource is something that is exported to a client, be it disks, network addresses, ect. The participating nodes maintain cluster state and replicate this state to all nodes. There are procedures used to add new nodes to the cluster, take nodes offline and pull resources from a failed node. Updates from any node to all other nodes use a locking mechanism to ensure all active nodes received the update even if the sending node dies before alerting all nodes to the update.

Flaw:
It was not clear to me how resources that have state attached to them are migrated between nodes. Many services have, for example an underlying database or file system, and upon a failure the extra state must also be migrated. Replication was mentioned briefly but the way those policies interact with the rest of the migration process was not explained.

Reliability:
Reliability is achieved by effectively providing redundancy within the system so that during a server failure the service can be provided by a different server. This approach is also efficient since during normal operation all nodes within the cluster can still be doing useful work and don't need to be sitting idle just waiting to take over in the case of a failure.

The Design and Architecture of the Microsoft Cluster Service [Vogels, et. al.]

Summary:

This paper presents the one commercial cluster implementation for high availablility: the Microsoft Cluster Service (MSCS). MSCS is a
formalization of how Microsoft NT services can be organized and run on a
small cluster of machines to acheive high availability.

Problem:

The problem was that large software developers were developing their own
HA technologies at the time (which was difficult to learn for system
administrators), and various operating system vendors were competing to
provide highly available servers. Specifically, Microsoft was competing
against HP, IBM, and DEC to be the server to host mission critical apps
such as Oracle and SAP but Windows design had some characteristics that
didn't lend themselves to virtualization.

Contributions:

* The primary contribution of this work is to introduce a formal
framework for how to make a Windows service highly-available. It
borrows many techniques from prior work, and reimplements some of that
done by other windows software vendors, so its value is really in just
the common framework from the operating system vendor.

* MSCS introduced three key features into Windows NT giving it the
virtualization capabilities similar to what can be done on other systems
by migrating configuration data with the application. Applications
detecting name of the [virtual] server on which they run is a perenial
problem, so explicit support for this is useful.

Flaws:

* The authors claim that one of their goals was "transparency" in that
the clients need not be modified to interact with the cluster. However,
this is really client compatibility rather than transparency because the
fail-overs are visible to the clients for some period of time, and more
so if they are session-oriented (stateful) services.

* The cluster service doesn't mention the notion of testing the services
from the clients perspective. Instead it opts to test health from one
server to another using what appears to be an out-of-band "cluster
network". There are a few failure modes that would be better detected
using a probe from a pseudo-client.

* Even with the features described, common tasks such as managing the
operating system user database in parallel amongst nodes in a cluster
remains a problem.

Reliability Impact:

This high-availability cluster service seems concerned with a all sorts
of hardware, operating system, and even server application failures.
It's goal is to reduce the downtime due to configuration confusion, such
as delays cut-over when performance "manually" by system administrators
and to reduce the mean-time-to-repair for all other problems by
migrating services to another node in a cluster that presents itself to
the clients as a virtual server.

Post a comment