« Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling | Main | Bigtable: A Distributed Storage System for Structured Data. »

Autopilot: Automatic Data Center Management

Autopilot: Automatic Data Center Management. Michael Isard, Operating Systems Review 41(2): 60-67 (2007).

Reviews due Thursday, 4/17.

Comments

Summary:
Autopilot is an infrastructure developed by Microsoft used for automating the management of a data center including software provisioning, deployment, monitoring and repairing. This paper presents the motivation and basic design of Autopilot, and also provides insight as to how such a system should be used.

Description of problem:
Modern data centers tend to provide services over a large number of commodity computers. Automation management systems is needed to provide automatically provision, monitor, deploy/rollback software on these machines, so that manual operation and errors caused by these operation can be reduced. In addition, such data centers require the automation management to be simple enough to support high scalability and fault tolerant to support unreliable commodity computers. Overall, the problem this paper tries to solve is how to design a simple (scalable) and fault tolerant in-house infrastructure for automatic data center management, which can reduce manual manipulation and management cost as well as improve reliability.

Summary of contribution:
As mentioned in the paper, most of the technology used in Autopilot components is similar to designs that have appeared in previously reported work and this paper is primarily a report on the work of others. However the original conception, the vast bulk of the design, and all the implementation of Autopilot is the main contribution of this paper. First, the design of Autopilot focuses on simplicity by adopting the simple non-Byzantine failure model to make the system high scalable. Many large scale systems share this principle and Autopilot confirms that it is right. Second, A full functional system is presented, including the device manager (central system-wide coordinator), provisioning service (network or OS boot), deployment service (replicas management), watchdog service (system and component monitoring), repair service (perform system repairs). Third, Valuable lessons of Autopilot including the weakness of TCP/IP checksums and system tolerance requirement of slow nodes and fail-stop errors are discussed.

Flaws:
One limitation is that the system seems to work only within the Microsoft ecosystem, which prevents the ability to reduce costs by using an open source OS like UNIX. Another limitation with Autopilot is that automatically handled recovery may not be guaranteed to be quick. Therefore, applications with low-latency requirements must layer their own failure handling on top of Autopilot. The authors feature an example of this with web indexing. It isn't really a fatal flaw, but does require extra developer effort for many common workloads.

Application to real system:
As the size of data centers continue to increase, there will be an increased demand and reliance on automatic tools which aim to reduce operational cost as well as cope with the increased in demand in maintaining an expanding data center. In addition, such system can be used widely, actually, Microsoft is extending autopilot to handle every windows live service including Bing, MSN and online-advertising space.

This paper describes the design of a system developed developed by Microsoft
for automating various management and administration tasks in the large,
relatively homogeneously-equipped datacenters in which their large-scale
applications such as Windows Live Search run. The economics of the
machines-per-human-administrator ratios in typical smaller-scale settings
become problematically expensive when dealing with a datacenter populated by
tens or hundreds of thousands of servers, necessitating vastly expanded
automation of deployment, service, monitoring, and other tasks.

Because the infrastructure providing this automation must itself be highly
reliable, Autopilot is itself a distributed system running within the
datacenter it manages. Its design consists of a number of services
communicating with the central Device Manager service, which is a
Paxos-managed distributed state machine which serves as the authoritative
source of system state. Other services perform tasks such as managing which
OS images and sets of files are deployed to each specific machine,
monitoring for unresponsive servers or other common problems, and performing
"repair" operations (such as a reboot) on failed machines. These services
communicate with the Device Manager via "pull" operations, though the Device
Manager can expedite an update by approximating a "push" operation with an
explicit request that service perform a pull.

One somewhat striking aspect of Autopilot's design is that it has no notion
of application-level error handling or clean shutdown -- it simply kills
processes when something acts up. While at first this seems like a fairly
drastic, heavy-handed measure, on further consideration it makes a great
deal of sense to me. By just killing things outright, the likelihood of
further trouble being caused by bugs in error-handling paths in application
code (which by their very nature are highly likely to be very lightly
tested, if at all, and thus probably buggy) is completely eliminated. As a
result, I would guess overall reliability is probably increased
substantially as compared to an approach that attempted to perform more
fine-grained error-handling and recovery.

One thing I was curious about that I wished the paper had gone into more
detail on was the ramifications on application design imposed by running in
an Autopilot-managed datacenter. The above-mentioned "there's no such thing
as a clean shutdown" design was certainly one major part of this, but the
paper made it sound like there were other significant aspects of it that I
never really got much of a sense of from it.

Large scale distributed applications is possible to be unreliable. In case of problems, to diagnose and fix problem is hard due to the large number of nodes. Autopilot is a distributed system used to automatically diagnose and fix problems in distributed applications.

Autopilot has three functions, deployment, diagnosing and problems fixing. When deployed by Autopilot, the distributed application is divided into several
manifests and given to certain type of machines. Then the Autopilot monitors the servers' conditions and application logs to diagnose problems. If there is problems, Autopilot would take remedies such reboot, ReImage or Replace according to the problem type.

Autopilot is composed of several modules. Deployment service makes deployment decision. Provisioning service probes networks and physical servers conditions. Watchdog service monitors server and application condition to find problems. Repair service is used to fix problems. All these modules are coordinated by the device manager.

Since Autopolit itself is a distributed system, it also has design principles, fault-tolerant and simplicity. To guarantee fault-tolerant, Autopilot use checksum when exchanging messages, keep small shared states to balance between strong and weak consistency and make distributed replicas for each modules. To guaranteed simplicity, Autopolit parameters are human-readable text, etc.

The contribution of Autopolit is that it design and implement a management software to control another distributed application automatically. But it also not perfect.
(1) Autopolit is application specific; it is only a framework. That is, for each application, the operator need to design each modules such as watchdog(what to monitor? what denotes problems?), deployment service(what is in each manifest?) and the message formate. In the paper, they only use it for Windows live.
(2) The problems diagnosing is a troublesome aspect, but it is not mentioned to much in the paper. What is the possible problem in the distributed application? To diagnose the problem, what features should be monitored? This will affect the application log and watchdog module design. And this should be different for each specific type of applications.
(3) About problem fixing, it is too coarse-grained. ( Maybe this is due to that it is the first version. ) It can be optimized. For example, if the application is to be replace, we can diagnose whether it is cause by bug or overload. If it is in an overload node, we can diagnose the bottleneck then migirate it to proper place according to the bottleneck.

The Autopilot paper presents a system that attempts to reduce the amount of manual work needed to keep a large connected system running. The goal is to automate everything feasible, from detecting and correcting errors to automatically updating software.

Keeping a large connected system running requires a large amount of effort. Many times a machine will need to be manually reset or updated. Failures can often cause damage that is not easily corrected. These issues have become more critical since an increase in the number of components increases the probability of a failure as well. The goal with autopilot was to create a management system that would be able to either take care of many problems that previously needed manual interaction or eliminate these problem cases. Some problems are reduced by requirements imposed on applications that will operate under Autopilot. These applications are expected to tolerate the termination of any process at any time, without warning. A application would be expected to keep multiple instances running so that several can be quickly killed by Autopilot. The rest of Autopilot itself also attempts to follow this rule, which allows for the quick shutdown of tasks in the case of errors.

Fault tolerance and simplicity are listed as the most important criteria of Autopilot. Fault tolerance is obtained through replication of all important data and the use of the Paxos algorithm. They also considered Byzantine tolerance but decided that the cases where this could occur were very rare. Eliminating unnecessary components was a part of their plan for simplicity. The general goal was to make the system as simple as possible, both in terms of the components themselves and for what manual interaction would be required. Autopilot provides several different services. It is responsible for keeping track of files and applications running on individual machines. It uses requests for certain machine types to determine what to load onto a machine. A machine type does not typically change the type of hardware used, just the software and files. It can also update files or programs loaded on the machines. A staged roll-out is used to begin loading newer versions of software on a few machines before applying it to the rest of them. Autopilot adds multiple monitors called watchdogs, which each check machines for certain specific problems. If a watchdog detects the problem it is designed for, it attempt certain solutions to work around the problem. When a failure is detected, a machine will usually fall into the failure state. If it can be recovered, it will move first to a probation state before being brought back into active use. If problems persist, certain more severe actions will be taken, such as re-imaging the machine. In the worst case, Autopilot will notify technicians of problems.

Autopilot's requirement for applications to follow certain rules seems as though it could limit or at least make difficult certain tasks that require immediate actions without delay, at least if Autopilot makes somewhat frequent use of its reserved ability to immediately kill processes. Near the end of the paper, the discussion transitions to monitoring services. It did not seem clear how this data, which it says are for operators of the system, fits into the goal of automating tasks. Clearly there is a certain amount of manual interaction still needed, but the paper did not always indicate the present limitations of automation.

The idea of automatically detecting errors and updating machines can be used in many distributed systems. The limits that Autopilot imposes on applications may reduce the number of tasks that could be used with the initial design but many of the synchronization tasks could be separated from the rest of Autopilot for use in other tasks. This is certainly an area where there are many uses for tools with this type of functionality.

AutoPilot:

This paper introduces an automatic cluster management system called ‘autopilot’. Autopilot manages software deployment, provisioning, repair, and monitoring with the aim towards increasing responsiveness and reducing human intervention requirements. They claim to have been able to reduce management prescence from 24x7 to 8x5.

It’s obviously a succesful idea as it’s been running at microsoft for years, but it may not be perceived well internally as even their case example of web indexing showed instances of how the applications being managed contained code to survive a failure of autopilot. In addition it appears to have been specifically tailored for the workload of microsoft. For instance, if you have a specific computer cabled directly to a piece of hardware with which it is interacting you loose the ability to replace the computer without physical intervention. Further if there is a specific piece of hardware attached to a computer you could imagine firmware associated with that device. An upgrade has the potential to change that firmware and thus brick the device. That problem could spread to up to a scale unit worth of computers ( ~ 500 ). The rollback mechanism then fails and as computers are connected to a specific device ( say a sensor ) replacement is not possible and the process halts.

A further topic that seems to not have been covered is the failure of network services. Take for instance the idea that the DCHP server for a part of the cluster has failed. If that happens and yet autopilot is still running could it not get stuck in a loop rebooting machines that do not recover from a restart in some time period?

One topic brought up continually in cluster managment papers is the scarity of bandwidth. Reimaging a machine has to be a bandwidth intensive proposition. While they do mention that the number of machines in repair is throttled it is not mentioned if those machines under repair are constrained to different regions of the cluster.

One thing never mentioned in this paper is the idea of security. How are autopilot messages secured? If the Device Manager’s idea of what the ground truth configuration of the cluster should be could be corrupted every machine in the cluster becomes vunerible.

Having stored time series data in an SQL database before, it would be interesting to find out how well the ‘cockpit’ implementation works. In general past experience seems to indicate that installing large amounts of time series information in sql does not work too well. Perhaps this is a point where ‘nosql’ databases as mentioned in class might be useful?

One thing I’ve absolutely got to know is how weak is the TCP checksum? Is it ever to be considered reliable? Are there size constraints? What is considered a ‘strong enough’ checksum? What are good choices to be make here? I did find ‘When the CRC and TCP checksums disagree’ from sigcomm in 2000, but is there a better paper?

This paper gives an abstract view of autopilot, an automatic data center management software that takes almost all the corrective actions necessary to keep a cluster of computers running in a datacenter. This means that there will be increased ease in the use on the part of administrators to manage crashes in the cluster. This also means lesser administrative resources are needed for the company to manage its cluster.

I feel that the paper gives a very simple and easily-understandable description of the high level idea of autopilot's architecture. It clearly says why each component is present and what kind of functionalities are provided by each of them. The low-level services section talks about some of the features like how deployment changes are propagated. The system is configured with information as to what server to contact for installing or booting its OS. This is the provisioning service which provides the OS for the system and this is replicated for fault tolerance. One thing that is clearly mentioned in the paper is the intention of ignoring byzantine faults and letting the application developer taking care of that. I feel this makes sense as byzantine faults are mostly related to the logic the application performs.

The paper gives an overview of the failure detection and recovery semantics. It uses watchdog processes(similar to linux) that do some action to check the integrity of the system. Also watchdog processes can be configured by the application designers. The watchdog process can do checks like checking the appropriate OS version or check the deployment for the existence or installation of all the necessary files. If a watchdog process detects a failure, it will be eventually propagated to the driver manager which is the integral piece that stores the state of the cluster and can trigger any action from its end. Driver manager works with satellite processes to store the state of the whole system. To keep it really simple, the designers of autopilot only take corrective operations by rebooting, reimaging and marking the nodes to be replaced. Wouldnt migrating the state of an already existing process to another machine in certain kinds of cases, say for a map-reduce process which executes on a shared distributed file system be beneficial. I agree that there are issues with respect to homogeneity of computers in a cluster, but cant the process be moved to a machine with same architecture to prevent loss of computation assuming the dependencies on disk is restricted to a shared file system. Atleast restarting a process is a viable option that could have been automatically triggered by watchdog itself.

The paper also talks about how it measures performance and stores it on logs for analysis. Does a computer get restarted if it enters a non-progressing state, say it tries to network boot but is unable to contact the DHCP and has no disk in it. This case is handled by marking this machine to be replaced by the device manager and hence becomes a simple and efficient solution. Some of the things when present in the paper would have made it more interesting are failures that can happen when some component in autopilot fails and how recovery would work in those cases. Since autopilot processes executes asynchronously and seems relatively light-weight in most cases, I am assuming that it wouldnt have had a major impact in performance even though they are new processes running in every system. One of the main striking points for me from this paper is the stress on simplicity in design. On the whole, I think this is a good outline of ideas for implementing an automatic cluster management service.

David Capel, Seth Pollen, Victor Bittorf, Igor Canadi

Autopilot is automatic datacenter management software produced to Microsoft to scale its data centers to today’s massive number of machines while keeping steady or even reducing the amount of work required by operators. It manages provisioning, deployment, monitoring, and “repair” of systems. It has allowed them to effectively reduce the requirements of being an operator while at the same time increasing reliability and decreasing costs per computational unit by removing much repetitive work and only requiring administrators for jobs that cannot be automated (eg physically removing a fault computer from the data center and replacing it). Reliability is increased by lowering human error and by making tasks consistently take the same known actions. The paper reports the largest cluster it has been run on is in the tens of thousands of computers, and it runs various critical Microsoft infrastructure.
Autopilot follows the normal distributed system rules: it is fault tolerant and consistent by replicating state and using distributed consensus (specifically, Paxos). Portions of the system are also decoupled to reduce chances of failure spreading to the entire system. Performance, by the nature of the problem, is not as a much of a concern, though they consider it explicitly in some cases. Simplicity, however, is important, and will win over performance when necessary. Various trade-offs between performance, “standard” practices, and simplicity are made throughout the system.
The system is, at a high level, broken into a set of components: the Device manager, the deployment, provisioning, and repair services, and the watchdog services. These interface with the application code and “Cockpit”, a central management and monitoring console. These all interact with each other while each one takes care of its responsibilities. The central state machine is managed by Paxos with the Device Manager, and all “ground truth” emanates from this. Other “truths” are cached and may be somewhat out of date until the other services notice that they must be updated. This is an interesting contribution in that there is a strongly consistent truth and weakly consistent views of that truth which may be stale for a period of time. This fits well into the model of independent services that are centrally managed.
Another interesting facet of the system is the failure and recovery model: all significant errors are fatal, and the only exit condition is forceful termination. They make good on this promise by using reboot as the main repair action, followed by re-imaging of the machine. This model removes the dangerously untested error-recovery paths that most software has, allows the liberal use of easy-to-reason-about asserts. No warning is given before the system is rebooted, which forces the writers of the system to handle failure gracefully. This central assumption allows Autopilot to manage a set of heterogeneous systems without knowing the inner workings of each one. All that needs to be provided by a system is a manifest file that allows the system to be started. This black-box approach breaks down slightly, however, as they treat data-intensive systems specially, and do give them messages notifying them that the machine will be imaged soon. This programming model is not novel, but they apply it in an interesting way to increase reliability and simplicity of Autopilot.
The probation state machine for for a failing machine allows newly repaired machines to settle a while before they are marked as fault again, and limits time a fault machine is allowed to survive. Autopilot does not attempt Byzantine tolerance: either only a small number of machines are misbehaving in non-malicious ways, or the entire set of software is corrupted (due to a bug or a fault deployment) and the incorrect behavior will win regardless. The watchdog system, where anyone can write very simple monitoring services for their system is another non-novel idea applied at a significant scale.
The paper describes the system at a very high level, and makes no attempt to provide details or evaluations of the system. We especially wondered what their mean time to failure and mean time to recovery were, as well as boot times of typical machines and even general performance characteristics of the system. This paper is clearly not a “how-to” guide, but without any evaluation at all, it is difficult to take their contributions seriously.
This system is applicable in the real world inasmuch as data centers are massively growing, and without automation, human intervention is not cost-effective or even feasible. Furthermore, automated rollout and fault handling allow consistent service that suffers fewer disastrous failures, which are frequently due to human error. It would be interesting to see how other Web-scale companies, such as Amazon, Facebook, and Google manager their similarly large data centers.
A topic we discussed is how well this system would work on small clusters of ten or a hundred computers. We noted that overhead may be significant, but it would probably function as intended, with one caveat: in a 10k-node cluster, a single machine’s failure has no impact on performance and is simply statistical noise, but on a ten-node cluster, a tenth of the capacity is lost. This may change the “cost” of failure or downtime that their system assumes.

This paper talks about Autopilot, the intelligent data center infrastructure developed for Microsoft data centers that provides the basic mechanisms for resource provisioning, deployment, monitoring and fault handling, on top of which, distributed applications can implement high level policies such as load balancing, scheduling and fault tolerance.
The problem the paper is trying to solve is to smartly integrate several individual approaches for data center infrastructure design and produce a unified solution that takes care of mechanisms for end to end support for running applications. This unified solution was required to reduce human intervention, reduce costs, facilitate the paradigm shift to using many easily failing commodity computers in data centers and make the infrastructure scale with the demand of web services.
The biggest contribution of this paper is perhaps, the simplicity of the design which is an implication of the clear demarcation between mechanism and policy. The data center was basically treated as an operating system and there were few services that needed to be provided the core infrastructure (forming the mechanisms) and some that had to be part of the application and use the underlying mechanisms (forming the policies).
More on the ideas expressed in the paper:
- The paper focusses on decoupling the system into key services that work independently but are coupled through the device manager, just like microkernel design of an OS where work gets done by processes interacting through the microkernel.
- It was a wise decision to have the Device manager as a the only piece of code that maintains shared state and have strict consistency guarantees over its replicas. The replicated state machine model was quite apt for the case of the Device Manager.
- The pull model for the satellite services and the kick messages were very nice and practical for the scenario.
- The provisioning and the deployment services, along with other satellite services co-ordinates with the Device manager to get deployments out.
- One nice thing mentioned in the paper is the idea of a “scale unit” to make staged rollouts of new code very clean and organized and the idea of offering same treatment to buggy versions and misconfigured versions.
- The idea of having the state of a machine as a predicate expressed in terms of several watchdog services and the idea of having the state transitions for deployments seem to be interesting ones.
- The collection service is a very thoughtful addition the system especially given the kind of services that are t be run in data centers (search and other web services that make heavy use of trends and graphs).
The paper does not have any flaws as such. There was clear thinking involved in drawing a line between mechanisms and policy (which is actually a rather difficult line to draw), thus making the design simple and very clean. One thing that bugged me though is the fact that normal exit need not be expressed in code and that processes can be only be terminated by killing them. The system might have had normally exiting processes and the monitoring processes could figure out clean terminations.
The paper is very applicable to today’s systems in that it provides a simple, clear design to infrastructure services that are critical to effectively maintaining data centers. Though there is nothing novel in the paper, the design in just combining the right components for supporting the infrastructure is very clean and nice.

The paper describes Autopilot which automates routine administrative tasks such as deploying, provisioning, monitoring, and repairing of data center applications.

The data-centers and number of services and the resources required by services is growing rapidly. In addition, data-centers are built using commodity hardware which are prone to failures. Microsoft realizes that there are many routine tasks such as installing software components, restarting machines after failures, monitoring an applications and so forth. The main challenge is to build a simple, reliable and fault tolerant framework that can automate such tasks.

In the core of the system is a Device Manager (DM) which holds the “ground truth” state that the system should be in. It is distributed to some number of machines and provides strong consistency using the paxos. DM does not perform actions to change the state of the system, it just knows the desired state. The satellite services (deployment service, watchdog service, repair service, provisioning service) communicate with the DM and help system reach the desired state continuously. For example, watchdog service may find a fault in the system and report it to the DM which updates its desired state (towards fixing that issue), then a repair service pulls this new desired state and repair the fault.

As the other Google, Amazon, Microsoft papers, this one also cares about the working system and the lessons learned more than the novelty of the ideas. On the other hand, the case study is limited and it’s hard to understand how to use the framework. Since it does not support legacy code, one should be able to easily plug an application to Autopilot; but how easy is that? From the lessons learned section, I had the impression that it is not that easy. I believe the system is designed so specific for the search backend, then they try to make it more generic. Since there are not much discussion on this it is hard to anticipate the evolution.

I believe the system is useful and beneficial although the paper does not contain much quantitative evaluation to reflect that. I wonder how other companies such as Google or Amazon handle automatic data-center management. I believe there are other advantages of such automated system other than the ones mentioned in the paper. Although the paper does not discuss it in detail, I believe one of the advantages of using an Autopilot like system is to gather large amount of log data. Since it provides a single framework, the quality of the logs would be uniform and better. I believe pattern recognition algorithms could be useful to extract valuable information out of these log data. Then the learned patterns could be further useful to predict the future behaviour of the system and possible results of specific actions.

The problem the auto pilot solves is basically managing services and the
necessary software stack in each machine in a data center in the presence of
machine failures. Instead of having administrators sit and configure each new
machine that is being plugged in to scale or repair an existing one, it runs
a set of services that do this. The basic task of autopilot is to deploy
services starting from scratch and detect failures and repair them.

The paper adopts a very simple approach rather a complicated and efficient
one. The work of management is done by dividing different tasks to different
management components that communicate with a single component called the
device manager which manages the global state as a distributed state machine.
The device manager is replicated and runs a paxos like consensus protocol to
maintain strong consistency. Other components make sure that different parts
of the application stack are built. This includes making sure that the right
OS is running, a filesync service that makes sure that the right files are
present to start any particular service and another to make sure that the
necessary programs that encompass a single service are running. They expect
the developers to write applications to be be aware of machines failing and
continue services in other running instances. Whenever the device manager
needs to change something, it does so by 'kicking' the remote machine to pull
and synchronize instead of pushing the updates. The device manager constantly
collects the state of the computers and makes decisions. Failures are detected
at machine level and not individual process level. The autopilot system to a
great extent relies on communicating with the machine at the BIOS level to
detect hardware failures and set up the operating systems. The failures are
repaired by restarting the machine or re-imaging the os or marking the machine
for replacement of hardware. The repair service tries to take a set of actions
and watch for errors. Once the machine functions without errors for a period
of time, it is marked as functional or else, a different set of repair actions
are performed. It is not clear in the paper though how the set of actions can
be arrived upon and if it is application/service specific.

I think the paper does a good job of explaining the list of activities
involved in managing a cluster or set of machines which are running a service.
I would expect every major company to have something like auto-pilot to
maintain their services and the machines. We can think of using auto pilot to
try provisioning services, monitor them for failures and either try to fix
faulty machines or launch new ones to compensate for them. It would be great
if developer can specify how many instances of a machine type is required for
his service and let auto-pilot like system manage the physical machines
running the service.

The paper explains a lot of information without going into details and giving
clear examples. For instance, they say that there can be many watch dogs
probing a machine but the failure is handled at a machine level. Does this
mean that for instance, if a machine is hosting two services and one of them
fails, the auto pilot will try to restart the whole computer ? Instead I think
it should restart just the necessary application. A fine grained
recovery/repair system would be very helpful.

Also, to bring down services, the paper suggests that services should not
expect to shut down cleanly. I find this not convincing. Instead of sending a
SIGKILL to the process, they can encourage developers to write signal handlers
that shut down the service cleanly and let the management software to first
try bringing down the service by sending a signal invoking the signal handler
and then try to SIGKILL it.

The paper is about AutoPilot, an automatic data center management framework that takes care of automatic software provisioning, deployment, monitoring and repair. There are a variety of reasons which necessitate an automatic management system. 1) The increase in large scale web services that run in data centers with large number of systems 2) The large cost of maintaining these large data centers because of the need for the availability of support staff round the clock 3) Increased need for reliability and availability in web services 4) As a means to reduce hardware cost, a large number of low cost commodity systems are used in data centers which have a greater probability of failure. So at any point in time there is something failing in large data centers. Autopilot is Microsoft's infrastructure to address these issues.

The system assumes that the applications are designed so that they expect their processes to be killed without prior warnings. It also assumes that Byzantine faults do not arise because applications are run in a controlled environment and even if they do arise due to bugs they can be detected using other mechanisms. The system also requires simplicity as a design principle for the large scale applications that are run on top of it for the purpose of maintainability. For the same reason the AutoPilot system itself has its components independent and distributed.

The AutoPilot system consists of a Device Manager that maintains the state of the system in a strongly consistent manner. Each machine is given a machine type that is manually configured and this list is maintained by the Device Manager along with the state it should be in. There exist satellite services that pull information from the Device Manager and update the individual machines periodically reducing load of the Device Manager and it also serves as heartbeat to detect failures. Each machine runs a filsync service for transferring files and logging the actions. It also has an application manager that checks if the correct processes are running on it based on configuration files. The provisioning service probes for new computers and contacts the Device Manager and installs the OS which then independently contacts the Device Manager and fetches the binaries. Each machine type has a set of manifest files with one of them as "active". These are stored in the Deployment Service and fetched by the machines independently after contacting the Device Manager to know if there are any updates. Watchdog services are run by autopilot to test certain attributes of the systems periodically. They return OK, warning or an error. If any of the watchdogs report an error it is marked ass failed and an appropriate recovery action is assigned based on the history and type of error. The Repair Service then periodically checks the list of failed systems and performs the repair action and places it in the probationary state where if it survives for long enough it is brought back to healthy state. It also has the Collection Service and Cockpit that help in monitoring the performance counters and logs and an Alert service that informs the support staff when needed.

The ideas in the paper are not new and are most probably used in other such auto management systems too. However, some of the design principles such as the design to keep the components to be distributed and independent make the system more maintainable. Other highlights are the concepts of "scale unit" and automatic roll back during roll outs that reduce the manual monitoring required during roll out of updates to large systems that may take too long to complete. Another is the probationary state that distinguishes between recently failed machines and stable ones that helps in preventing machines that are already undergoing repair to be listed in the failed list before the repair stabilizes. Overall, AutoPilot seems to be an essential components of large and growing data centers that apart from reducing the manual workload also serves to increase reliability and availability.

This paper introduces a automatic data center management infrastructure, Autopilot. This system helps the automating software provisioning and deployment, system monitoring, and taking repair actions when faults happening.

Autopilot tries to alleviate the cost on the management of data center infrastructure. Data center usually contains large number of commodity computers. They need a lot of operational and capital expenses. Reducing the repetitive work handled by operations staff can reduce the cost of operations. At the same time, it can also increase the reliability since it reduces the error caused by human errors. Thus, Autopilot is a solution to this problem by automatically managing the data center.

The main contribution of this paper is the design of Autopilot. This system is based on an assumption that all the applications run on Autopilot should be manageable. Two key design principle for Autopilot are simplicity and fault tolerance.

Generally speaking, Autopilot provides several basic services to keep a data center operational: provisioning and deployment, monitoring, and repairing. Thus, it includes several components: Device Manager, Watchdog Service, Deployment Service, Repair Service, and Provisioning Service. Device Manager is a strong consistent state machine that is typically distributed over 5-10 computers. Actions based on the state information is taken by lazily the satellite services mentioned above. All the satellite services are themselves replicated, and receive the information using a “pull” model.

Low-level services includes filesync service that ensure the correct files are present, and application manager to make sure the correct processes are running. Provisioning Service provides services including DHCP, network boot, and scanner for new devices plugged in. This service is redundant, and use a protocol to elect a leader when boot. Deployment Service is a set of weakly consistent replicas that contain set of manifest directories for different machine types. Autopilot partitions all the computers into multiple partitions to allow staged rollouts of a new code version.

Each machine has three states: Healthy, Failure, and Probation. Watchdog Service periodically checks the states of every computer and reports OK/Warning/Error state. Repair Service periodically asks the Device Manager for a list of machines in failure state and repair them. Monitoring Service forms a distributed collection and aggregation tree for performance counters and logs.

One flaw I think in this paper is the lack of discussion on the cost of Autopilot in a data center. From the description, it shows most of the services of Autopilot are distributed or replicated into multiple nodes. If use Autopilot to manage a small cluster, how many machines or how much computation resources does the deployer need to allocate for autopilot? How about the numbers when managing a cluster with thousands of nodes?

The other flaw is in the discussion of monitoring service. The monitoring service collects information about the states of the machines. I think it is used for staff to analyze the running of clusters. However, I think it is still possible to use the data to do real time decision and scheduling. For example, one machine may become slow because of old hardware or some incorrect states. Autopilot may be able to allocate a new node of the same type to replace this one. This should not be complicated. However, the current version of Autopilot only make those decision based on simple states reported by watchdogs service.

In sum, I believe this is a good idea, and also a practical approach to manage data center automatically. It has been examined by the large services provided by Microsoft. The automatically provisioning, deployments, and error recovering can alleviate the operations staff from the lots but repetitive jobs. I think it is also a good reference for other companies with data centers to build a tool for management.

Post a comment