Autopilot: Automatic Data Center Management
Autopilot: Automatic Data Center Management. Michael Isard April 2007.
Reviews due Thursday, April 14th.
« Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks | Main
Autopilot: Automatic Data Center Management. Michael Isard April 2007.
Reviews due Thursday, April 14th.
Comments
Summary: Autopilot is a framework to automatically manage the deployment and repair of computers in Microsoft's data center. A set of distributed services install system images, launch applications, detect failures, and initiate repair operations.
Problem: Managing a large data center is a complex task. Applications need to be installed, new systems need to be deployed, and failed systems need to be repaired. Hiring enough staff to perform these tasks can result in high expenditures. Furthermore, human error from manual deployment and repair efforts can result in misconfiguration that exacerbates failure situations. Ideally, many of these data center management tasks would be automated and standardized, avoiding the need for staff intervention and reducing the likelihood of misconfiguration. The challenge lies in designing a system that can withstand failures and maintain control of data center machines regardless of failure conditions.
Contributions: Autopilot's uniqueness lies in the unified control it exerts over the data center. Rather than having a mishmash of scripts for deploying applications and haphazard repair procedures, Autopilot has a central device manager that provides instructions to a set of services designed for specific tasks. The device manager maintains a strongly-consistent view of the data center to know which actions have already been applied and which actions need to be initiated. Each of the specific management services are responsible for executing actions to make computers match the state the device manager expects. For example, if the device manager expects a set of machines to contain a specific operating system image, the provisioning service will deploy the system image when it pulls the state for the machines from the device manager. If system image installation continual fails on a particular machine, the repair service may remove the machine from service and place it on a list for physical replacement.
Flaws: Needing to write application specific watchdogs to detect failures makes adding new applications complex (and may make Autopilot unsuitable for some legacy applications). Furthermore, sufficient watchdogs need to be implemented to ensure all failure cases are detected, not just the common expected cases. This may mean being more aggressive than necessary and declaring failures even when a computer could continue operating as is.
Applicability: Autopilot can be applied to any data center management situation where large groups of machines share similar configurations. However, the ability to leverage the framework is limited by the need for applications to conform. Legacy applications may not work with Autopilot, requiring a mix of manual deployment and repair with automated deployment and repair: a situation that can rapidly become complex and result in a degradation of service from competition between manual and automatic processes.
Posted by: Aaron Gember | April 13, 2011 09:23 AM
Summary
AutoPilot is an infrastructure for automatic data center service deployment, monitoring, and repair. By using a centralized device manager and a variety of satellite applications, AutoPilot is able to perform many data center management tasks autonomously.
Problem
Large data centers are difficult to maintain. They require human operational stuff, often all the time. Further, many failures can be attributed to human error. Automation can solve many of these problems, but constructing an automated system that appropriately deals with the wide range of necessary tasks and does not, in fact, produce substantial additional work for the human operators is challenging.
Contributions
AutoPilot provides an infrastructure for autonomous data center management. There is a single state machine, the Data Manager, which is strongly consistent and replicated. The Data Manager stores the state of the system, and provides this information to satellite processes via a pull model. A Provisioning Service is used to locate new machines, query the DM to determine the appropriate OS, and install the OS. To deploy new code, the Deployment Servers are updated (single operator command) and the Device Manager rolls out the update appropriately.
Watchdog processes monitor machines and report results to the DM. To simplify this process, only three states exist: ok, warning, and error. A single Error message suggests that a machine is in error and needs to be addressed. Recovery actions also take one of three potential values: reboot, reimage, and replace. The Failure/Recovery state machine is used to determine appropriate recovery action to take. Computers marked for replacement require human intervention, obviously. Using this mechanism, the DM can control when machines are rebooted, etc., which allows it to avoid a full-system reboot.
Flaws
AutoPilot assumes that new applications are built to work with this infrastructure. While it’s certainly true that many data centers, especially within a single company, may be running new code and so this requirement makes sense, there are plenty of data centers that could benefit from this model but require adaptation of legacy software or adaptation of the infrastructure to the legacy software. Without this, the generalizability of this approach is limited.
Relevance
This paper addresses an important data center problem. Human involvement (and human error) are expensive, particularly as data centers scale up. Thus, providing an autonomous solution allows data center users to remove the human factor when possible, relying on software to solve complicated problems. AutoPilot provides a framework to address this new model; however, AutoPilot is somewhat limited by the initial assumptions made (no legacy software, etc.). Extensions to this approach would need to address these limitations in order to succeed.
Posted by: Emily Jacobson | April 13, 2011 07:40 PM
Summary:
Autopilot is a distributed system for automatically deploying software and detecting and mitigating faults in a datacenter setting. It provides these services with a strongly consistent central service that makes all management decisions as well as implementing automatic rollback for software deployment and lazy repair for hardware failures.
Problem Description:
Autopilot addresses the problem of automatically managing a datacenter infrastructure with minimal support staff involvement. A solution to this problem will decrease the operational cost for datacenter support staff. Additionally, it has the opportunity to limit the number of errors due to human involvement. I think every previous approach to managing a datacenter includes some sort of network monitoring system similar to Autopilot’s watchdog system. However, Autopilot introduces a level of automation that wasn’t achieved by previous solutions.
Contributions Summary:
While there are a plethora of existing tools for configuration management and network monitoring, I don’t think they provide the level of automation that Autopilot provides. I think Autopilot makes three interesting contributions.
First, it provides automated rollback in the face of failures during software deployment. This is especially useful feature in SaaS settings, where deployment is done completely in the datacenter. Second, Autopilot implements a lazy repair mechanism that attempts to limit the impact of large scale failures as well as tentatively repurposing failed nodes. Third, Autopilot provides a service for automated provisioning. That is, support staff just needs to rack a new system and Autopilot will detect the new system and provision it.
Shortcomings:
I think Autopilot’s overall design is great. However, there are a few areas where its behavior could be improved. The largest omission is that Autopilot doesn’t manage network configuration and name services because Active Directory has built in fault tolerance and management. I think the interaction between Active Directory and Autopilot could cause problems especially if Autopilot detects problems that can only be solved by updating Active Directory’s configuration. The authors do mention that this is a target for future work. Also, Autopilot doesn’t include support for task migration and load balancing. While it makes sense for the application to implement pieces of this functionality, Autopilot has a more global view of the datacenter that could be hosting multiple applications. This global view would be unavailable to the applications. I think this functionality is particularly pertinent in a cloud setting.
Application to real systems:
The authors provide a lessons learned section where they outline a few failures they encountered while deploying Autopilot and related application services. Of the lessons presented, I found a few to be quite interesting. The first is that configuration files should be checksummed to avoid configuration drift. The next interesting lesson is that failure detectors need to distinguish between failures and overloading to avoid cascading failures triggered by overloading. I think these lessons are valuable and can be applied to system that provides automated fault recovery.
Posted by: Dan McNulty | April 13, 2011 08:28 PM
Summary
Autopilot is the automatic data center management infrastructure developed within Microsoft. The purpose of this system is to automate the procedures that are usually done by the operations staff in the data center, some of which require the staff to be on call 24/7. The hope is that reducing human intervention will reduce errors and different versions of doing the same job. Having fault tolerance and simplicity in mind, they design a system that encompasses deployment service, provisioning service, repair service and watchdog service. All of these services are eventually controlled by a device manager that has global view of the system.
Problem Statement
The problem that this paper is trying to address is the fact that as data centers get larger the maintenance and troubleshooting gets much more complex. Human error or inconsistency in how different operators handle issues will make this maintenance even more complex. Moreover, labor is very expensive. Therefore, if we can have an automated system that works well, we could reduce the operations staff and the complexity of the system. This is the same trend as in the manufacturing area that prefers to use robots.
Contributions
Building a “modular” automated system for data center management where different parts of the management system are handled by a different module even though the device manager has a global view of the system.
Critique
I think automation is a very good idea and is already used in most situations that we need to do some job in batch. It is only the extent of automation that is different. Here, Microsoft tries to have a highly automated system. There are some points that I think should be made about this automated system:
I particularly was not sure how well the watchdog idea would work. It puts the burden on the application developer to decide what watchdogs are needed. Moreover, it is not easy to figure out all the watchdogs that are necessary in advance. Therefore, I believe this is an interesting idea and would sometimes help, but I am not sure if it will cause the developers to minimize their use of the alert system.
The paper does not fully address the issue of system flexibility. It is true that an “ideal” automated system simplifies many of the operations staff and can minimize the human error and intervention. However, it seems that many automated systems give an abstract view and an interface to the operator that might not be flexible enough. This is because making an “ideal” automated system is very hard and complex. The problem with abstracted and inflexible system is that soon there will be workarounds by the operations staff on the system which will be inefficient and again error prone.
The paper makes an argument that automation is very helpful. However, they do not give an idea of how hard it is to switch to an automated system. I believe that there will be a long learning curve for both the operations staff and the developers to start effective use of this system.
Applications
With the size and scale of today’s data centers it is inevitable that some amount of automation will be necessary for continued growth of data centers and their services. Therefore, this system can be a good experience and example for the other companies that will later consider such automation.
Posted by: Fatemah | April 13, 2011 08:53 PM
Summary:
The paper is about Autopilot, which is a infrastructure developed by Microsoft used for automating the management of a data center. It's about the design of first version of Autopilot which concentrates on the essential services needed to keep a data center operational automatically.
Description of Problem:
The number of servers in data centers are growing due to increase demand from popular web services. The cost of operating a data center is not cheap. There are many repetitive tasks done by operators that can be automated. Data centers need to be reliable and many failures are caused by human error. The problem is to develop an infrastructure that can automate the management of a data center in order to reduce cost and provide reliability in an efficient manner.
Summary of contributions:
The paper discusses the design of the first Autopilot which consists of a Data Manager which is a replicated strongly-consistent state machine system that uses Paxos. Autopilot includes a number of satellite services rely on the Data Manager to provide the information the services need to perform their tasks such as deployment, provisioning, repair, monitoring, and collection. The services try to be as basic and simple as it can be.
Flaws:
I did not find any major flaw in their design besides the one they already mentioned. The paper mentions that the designed assumed non-Byzantine failures and they say they haven't experienced any major faults. The lessons learned from version 1 sections mentions some of the flaws from the first design that were noticed such as: network hardware malfunctions, slow running computers, distinguishing between failure and overloading.
Application to real systems:
Data centers are so huge that it is difficult and expensive to manage all the servers manually. Autopilot provides a framework which has been used by Microsoft to automate its data center. Since this is the first version of the system, I'm sure that many issues has been addressed in the ladder versions. Automatic data center management is cost effective if it can perform what the data center was meant to do. There are many issues that needs to be resolved in order to make the infrastructure more efficient such as distinguishing between failure and overloading, but the idea is very appealing since it requires minimum human supervision.
Posted by: Kong Yang | April 14, 2011 12:05 AM
Summary:
This paper present Autopilot, an automatic data center management system for software provisioning, deployment, monitoring and repairing. It was originally used as Microsoft's Windows Live Search backend and other forthcoming large scale deployments inside Microsoft.
Problem:
With the growth of data center capacity, the management for such a large scale system is becoming difficult and expensive. For example, lots of machines are running for a single applications, failures are universe across all components, the requirement of availability is very high, the competing requirement of low cost. These unique aspects in modern data centers call for highly automatic, available, efficient and reliable new infrastructures.
Contributions:
1. Simplicity is the top design principle. They adopt the simple non-Byzantine failure model, they reject complex solutions even that is more efficient. Many large scale systems share this design principle, and Autopilot confirms this is the right choice again.
2. A full functional system is proposed, including the device manager (central system-wide coordinator), provisioning service (network or OS boot), deployment service (replicas management), watchdog service (system and component monitoring), repair service (perform system repairs). These components can collaborate with each other to automatically manage a large scale of computers.
3. Valuable lessons of Autopilot are shared and discussed. For example, they found that TCP/IP checksums are week and need additional application level checksums; systems need to tolerate and slow nodes as well as fail-stop errors, throttling is crucial.
Flaws:
1. Autopilot uses a central-style device manager. However, it seems that there are lots of communications and coordinations between device manager and other components. This limits the scalability of the whole system. Also, it may experience single point failure. So, it would be better if the device manager is designed in a distributed manner.
2. The whole system is not flexible enough for generic applications. It seems that it is hard to configure Autopilot for different type of applications. Since data centers may hold various applications, the underlying infrastructure should be generic, flexible to host new applications.
Applicability:
Large Internet companies are building huge data centers for the growing market of cloud computing services. Obviously, the data center management is the core focus for better service and lower cost. Autopilot provides very good references and experiences for how to build automatic management infrastructures.
Posted by: Lanyue Lu | April 14, 2011 12:42 AM
Summary:
This paper describes Microsoft's internal datacenter management infrastructure which they call Autopilot.
Problem Statement:
Large scale datacenter are hard to manage with the huge number of moving parts - machines, network devices, a large human workforce to manage the infrastrucuture etc. A lot of the 'datacenter' operations, though, are repetetive in nature, and hence, left to control by operators, are prone to errors. If we can automate these tasks, we can achieve simplicity and better reliability. Autopilot was Microsoft's way of solving this problem.
Contributions:
The emphasis on simplicity of setup makes a lot of sense, because a datacenter is very large scale operation and undue complexity is not a good idea with large scale systems. I also liked the component-ization of operations such as provisioning, deployments, monitoring, the repair service. The system also employs the Condor principle of well-defined and limited responsibilities for modules. The implementation of manifests (simultaneous manifests with only of them active at a time) helps with zero downtime upgrades. Since the machines in the datacenter are within the network DMZ, the design assumes a relaxed security and fault model. The Device Manager-Scale Unit based architecture allows for pipelined rollouts of new services while keeping other services alive.
Flaws:
The recovery model allows only few coarse grained actions such as Reboot, ReImage, and Repair. This model might not work in a node that hosts multiple services (even Autopilot services) because failure of one of them will cause the entire system to reboot. The DM - a cluster of 5-10 computers managing the entier datacenter - seems to play a very important role in co-ordinating the Autopilot-nodes, which might make it a single point of failure.
Applications:
Having been an ops guy managing cloud deployments at my previous employment, I can appreciate the need for a tool/infrastructure that helps with provisioning, deployments, and monitoring. Autopilot fills all the needs at the datacenter-scale. With the increase in cloud computing and datacenters, such solutions will be even more relevant in future.
Posted by: Srinivasan T | April 14, 2011 12:52 AM
Summary: Autopilot is a system for automating the management of resources in large data centers. Features allow administrators to deploy operating systems, application software, networking configurations, and so on across a heterogeneous environment. The system monitors machines and automatically handle failures.
Problem: As the popularity web services like Live Search and Live Mail grew, Microsoft found they needed to manage very large numbers of servers. The total cost of operating these data centers included high costs for repetitve work handled by operations staff. Automated management systems like Autopilot help to reduce these labor costs. Additionally, human workers are prone to mistakes and automation software can reduce errors and increase service reliability.
Contribution: The central component of Autopilot is the Device Manager, a single strongly consistent state machine replicated over a small number of the servers. The Device Manager records the state that the system should be in at a given time. A number of satellite services running on some subset of the servers uses the information in the Device Manager to bring the system into agreement with the Device Manager. Satellite services include provisioning OS images, filesync for ensuring correct configuration files, deploying application code, and watchdogs for detecting failures. Failure and recovery are handled by modelling each server as a state machine with Healthy, Failure, and Probation states.
Flaws: One big limitation with vanilla Autopilot is that recovery, while automatically handled, is not guaranteed to be quick. Applications that have customer facing services with low-latency requirements must layer their own failure handling on top of Autopilot. The authors feature an example of this with web indexing. This isn't really a fatal flaw, but does require extra developer effort for many common workloads. Another limitation is that the system seems to work only within the Microsoft ecosystem, which prevents the ability to reduce costs by using an open source OS.
Applications: Microsoft's success with using Autopilot do support some of the largest services on the web for such a long period of time certainly is a strong argument for its effectiveness. Services from the very large to the small can benefit from automation to reduce costs and importantly to handle failures and reduce downtime.
Posted by: Kris Kosmatka | April 14, 2011 01:36 AM
Summary:
This paper presents Autopilot, an automatic data center management infrastructure developed by Microsoft.
Problem:
The problem this paper tries to solve is how to design a in-house infrastructure for automatic data center management, which can reduce manual manipulation and improve reliability.
Solution:
The key component of Autopilot is Device Manager, which system-wide authority for configuration and coordination. There are several satellite services around Device Manger. These satellite services receive message from Device Manager, and lazily perform command from Device Manger. Satellite services include:
a. Provisioning Service which guarantees that each computer is running correct system image.
b. Deployment Service ensures that each computer is running correct set of application processes.
c. Watchdog Service detects errors and reports errors to Device Manager.
d. Repair Service cooperate with the application and the Device Manager to recover from software and hardware failure.
e. The Collection and Cockpit passively gather information about the running components and make it available in the real-time for monitoring.
Flaw:
1. Applications, built on top of Autopilot, need to be manageable. This means that application developers may need to redesign their applications and do some extra programming work, like writing watchdog scripts, in order to fulfill the demand from Autopilot.
Application to real systems
1. several computers consists a scale unit. Autopilot will concurrently update a scale unit. This is quite like a deployment transaction, which can keep programs consistency in application computer with the same type.
2. I like the idea that each application computer keep several versions of software and configuration file. So they can roll back to former version, or change to run new version.
Posted by: Linhai Song | April 14, 2011 01:37 AM
Summary:
The paper describes Autopilot, a distributed system used in Microsoft to automate various operations like deployment, monitoring, provisioning and watches for fatal etc for production systems. Such a system is useful to reduce manual operations especially when services run on thousands of machines and hence are hard to operate on manually. Autopilot automates most of this process reducing the human effort in the operation of such system.
Problem Statement:
The primary problem in such a system is that given that large scale distributed system tends to use commodity computers. The number of machines running a particular service is extremely large which brings in the challenge of how to automatically provision, monitor, deploy/rollback software on these machines. Automation is important to prevent human errors (which can be frequent given the number of machines) and the human effort and the scale is almost hard to monitor by human beings. Also, one particular service is now monitored, developed, operated and tested by many people than one single person and hence requiring simplicity of design.
Contributions:
1. Gives a view of various operational challenges at the scale at which big companies like google and Microsoft are currently running. Such challenges are unique and require significant effort and can be managed much nicely by automation than by manual effort.
2. Important principles of avoiding bottleneck systems and hence surviving in spite of failures, simple and easily operable design and providing “best effort services” than absolutely “correct” services.
3. A small distributed management service (referred as device manager) to provide various services in autopilot like deployment service, repair service etc. This is the core of the system.
4. Various services like automatic provisioning of machines, application deployment over a large set of machines (which might be even a day long process) with facilities of quick rollbacks, monitoring service with alarms for human debugging, and performance statistics information.
Flaws:
No Flaws as such. But it would be interesting to know how much important was the Autopilot Service itself. As pointed out, this might be a single point of failure and can actually lead to collapse of operations of all other service. Hence, it is important to know the criticality of this service. Also, the paper talked about various improvements which were added later in the newer versions, one of the interesting one would be how it monitored services distrusted in various datacenters.
Application to the Real System:
Given that Microsoft ran this software in production over windows Live Search, such a mechanism is really useful at the current scale and such a system supports various operational processes giving much more power and ability to the engineers involved with these systems. A small service of device manager and yet capable of monitoring a huge system seems like a positive and good design.
Posted by: Ishani Ahuja | April 14, 2011 01:51 AM
Summary:
This paper presents AutoPilot, an automatic data center management infrastructure which monitors the status of the computers in the data center and automatically repairs misfunctionging ones.
Problem:
How to automatize the operation (including software installation, system reboot, OS installation) of a data center to the maximal extent, so that expensive manpower can be saved and human errors can be avoided.
Contribution:
1. The system architecture with a centralized strong consistent component (the Device manager), along with a couple of weakly consistent satellite service, is both simple and flexible. It also allows modularized software development.
2. An extremely simple failure model (in the unit of computer instead of process) and remedies greatly improves the generality of the system; which should provide insights for others.
3. Software update/rollback is handled automatically.
Flaws:
1. There is no performance evaluation, so it’s hard to know the overhead of such an infrastructure. It is especially interesting for the Device manager, since it maintains strongly consistent centralized state, and is likely to be the bottleneck of the system.
2. As mentioned in the paper, AutoPilot doesn’t support legacy applications. Applications have to be AutoPilot aware to allow automatic management. The paper should explain what kind of support AutoPilot needs from the application in more detail.
Applicability:
AutoPilot is already in production use in the Microsoft data center, and can potentially be deployed in other large scale data centers where computers run a limit set of applications. However, it requires the application to be AutoPilot aware, which will limit its deployment to data centers running legacy applications.
Posted by: Suli Yang | April 14, 2011 02:58 AM
Autopilot: Automatic Data Center Management
Summary
This paper describes an automatic data center management infrastructure that can manage thousands of computers for large-scale services with minimal human intervention. Autopilot tolerates faulty computers and supports automatic software provision, deployment, monitor, and repair services.
Problem
Manual management of data center requires lots of human interventions. We need to hire more operator staffs on 24-hour call to handle failed computers in the data center, which is inefficient. Also, human may introduce errors that cause data center failures when they are trying to fix problems. Automatic management is needed to reduce the management effort of operators.
Computers may fail at any time, achieving high availability and consistency even failures happen is a big issue for the design of system.
Contribution
1. Autopilot uses centralized component (Device Manager) to keep relatively small shared state in a strongly-consistent state machine using Paxos algorithm.
2. Other services are satellite services which communicate with Device manager regularly to "pull" latest shared state. But satellite services can have their own private states and keep their states in the weekly-consistent way, which enables availability.
3. With Autopilot, deployment of new code can be started with a single operator command. When device manager receives the command, it will synchronize the configuration with computers in a scale unit and try to run new codes in them automatically. And if insufficient computers succeed to run, the deployment will be rolled back like a transaction roll-back.
4. Autopilot provides failure/recovery state machine and watchdog that can periodically detect failures and perform appropriate recovery actions in a simple but effective way.
Flaw
1. Autopilot needs to wait for a relatively long time before a computer is moved from probation state to healthy state, which makes recovery time large. For temporary failures, it would be better for autopilot to replace the failed machine with a spare pre-installed machine to maintain availability and same load for remaining machines during the recovery. When the machine returns to healthy state, it can return to its position again. The spare machine can be used in the future.
2. Failure detectors and recovery services could be application specific because different applications have different standard of failures and specific recovery actions. They know better than autopilot about how to handle failures. Maybe autopilot could provide interfaces to applications so that they can implement their own failure detectors and recovery services.
Application
Microsoft is extending autopilot to handle every windows live service including Bing, MSN and online-advertising space. Autopilot team is building features like dynamic resource allocation, automatic failure detection/recovery, software load balancing, virtual drive to provide better dynamic capacity scaling and higher availability.
And they are going to scale autopilot to manage ~100000 machines.
Posted by: Weiyan Wang | April 14, 2011 03:08 AM
SUMMARY
The paper describes Autopilot, the automatic data center management infrastructure developed within Microsoft. It gives a high level overview of its low level provisioning & deployment service and how it automatically helps in maintaining state, detection and restart/recovery of failures.
PROBLEM
As Microsoft was widely expanding its web based services on a very large scale, a large cluster of server computers were absolutely necessary. As the scale grew tremendously it was hard for operations staff to manually manage the computers necessitating an automatic management infrastructure. Keeping the scalability in mind, fault tolerance and simplicity of the management autopilot software became important design consideration.
CONTRIBUTION OF THE PAPER
The paper gives a rough outline of the Autopilot data center management software is broken down into number of components to ensure a simple design. A device manager is basically a replicated state machine about the small number of computers . Satellite services take care of updating the device manager. They receive information from the Device Manager using a “pull” model using regular heartbeat messages. The provisioning services takes care of discovery using DHCP and network boot. Also, the application defines a set of machine types that are present in the cluster. Each type pertains to a role that the computer might take on. The main power of Autopilot lies in its automatic fault detection and recovery. Detection is done using watchdogs. A watch dog probes one or more machines to test attributes and sends OK,warning or error message to device driver. An error predicate of the device manager could be thought as a conjuction of all warnings. Based on the messages the recovery machine takes an appropriate action from the set DoNothing, Reboot, ReImage, Replace. The collection service forms a distribution collection and aggregation tree for performance logs.
CONCERNS
I personally felt the description of Auto Pilot was quite general and abstract. Not much explanation was given on how false positives are handled by the device manager software. No information about performance or the scale of the distributed system is presented. The no support to legacy software is indeed a restriction. However the team I plan to work in Microsoft this summer is a similar product called MDOP(Microsoft Desktop Optimization Pack) that is a management and virtualization software on Enterprise Windows cluster. Hoping to relate more to AutoPilot then.
RELEVANCE TO CURRENT SYSTEMS
As stated above I find MDOP (http://www.microsoft.com/windows/enterprise/products/mdop/default.aspx) very similar to Auto Pilot in the sense that both are manageability software for a cluster of computers( Just that the former focuses on enterprise windows and the latter on Microsoft’s internal data centers).
Posted by: Karthik Narayan | April 14, 2011 04:40 AM
Summary
The paper briefly highlights Microsoft’s Autopilot, an automatic data center management infrastructure. The paper highlights the implementation and design decisions of Autopilot, services provided by the system as well as lessons learned while using an automated system.
Problem
Data centers today have shifted from using high-end expensive (and reliable) servers to commodity computers that have an increased likelihood of failure. As a result, a high cost in running a modern data center involves monitoring and maintaining failed components. In order to reduce the operation expenses of a data center, Autopilot attempts to automate repetitive human tasks thereby reducing the size of the operations and on-call staff.
Contributions
The paper discusses at a very high level, the motivation for Autopilot’s design principles as well as an overview of the major components of the system. It also provides a case study of Autopilot’s deployment in a production environment as well as lessons learned from this deployment.
The major components of the Autopilot system include a Device Manager, which stores the state in which the system should be in, and several satellite services, including the deployment and watchdog services, which are responsible for ensuring a client’s state accurately reflects and matches the state stored at the Device Manager as well as monitoring the health of a system and when possible initiating repair of the system.
Autopilots deployment services are responsible for bringing a machine online by installing the necessary files and applications required by the computer as described by its application manifest. The watchdog services are responsible for monitoring and reporting the health status (Healthy, Probation, Failure) of client computers. Based on the health status reported and several heuristics used by Autopilot, this information is used to drive the lifecycle of a computer’s repair state.
Flaws
One of the design principles of the Autopilot system was to be as simple as possible. In general, simplicity of a system improves the maintenance and growth of a system. However, as pointed out in the lessons learned, this minimalist approach coupled with Autopilot’s tight development with Windows Live Search resulted in inadequacies with several services provided by Autopilot for applications other than the Windows Live Search. This is likely due to the in concert development and deployment with Windows Live Search.
Applicability to real systems
As the size of data centers continue to increase, there will be an increased demand and reliance on automatic tools whose purpose are to reduce operational cost as well as cope with the increased in demand in maintaining an expanding data center. While the system presented in the paper still relies on human intervention, I imagine future systems will include both software and hardware tools such as robotic arms in the maintenance process; thereby eliminating the amount of human intervention required.
Posted by: Greig Hazell | April 14, 2011 07:38 AM
Summary
Autopilot is a tool for managing large datacenters with the objective of automation to the greatest possible practical extent. This paper presents the motivation and basic design of Autopilot, and also provides insight as to how such a system should be used.
Problem
As datacenters become increasingly vast, the cost of manually performing management tasks which are frequently required, like such monitoring, error detection, and recovery, becomes highly expensive. For example, rolling out a new version of software while preserving the integrity of existing installations, is a tricky operation which could be easily foiled by human error. Due to the normalized structure of such a task, however, it is possible to create infrastructure which automatically manages this task, with less chance of irreversible failure. This is the goal that Autopilot seeks to achieve with all of its responsibilities.
Contributions
The value of the system comes from the simple and well defined interfaces it provides. For the datacenter, the benefits are obvious: It becomes easy to tell which hardware to replace, there are automatic tools for scaling and installing new nodes, and generating system performance statistics is simple. The API interface to the client programs also confers considerable value, because certain common features to not need to be re-implemented by every datacenter app. Since Autopilot maintains highly replicated machine state, the application may not need to do error detection, and be able to outsource this logic. Oppositely, if an application knows that a machine is failing because of specific information that it has, it can inform the centralized component of this failure and requested action.
Limitations
One main limitation is that in building an application for autopilot, or any similar system, a dependency is created which may make it difficult to migrate to a different datacenter platform. This is an inherent difficulty which may hinder it’s acceptance/popularity.
Another challenge is that the developers of Autopilot won’t really know which features and what customizability is going to be important to applications until they are actually using the system. The examples here being that the development of fast error recovery and throttling of reimaging operations weren’t around until an application required it. In the case of Windows Live Search, this application had to re-implement many features that were supposed to be handled by autopilot, because the requirements weren’t known about in time. It’s hard to see this as a success. I suppose the hope is that once enough experience is gained, the system will be customizable-enough for a majority of apps.
Applicability
Autopilot has already proven its usefulness as it is deployed for use with large-scale internet applications. Also, datacenter computing is probably sticking around for the time being, so I imagine that tools like autopilot are going to be extremely useful and popular in the future.
Posted by: Tony Nowatzki | April 14, 2011 07:39 AM
Summary
This paper discusses how data processing centers can be managed on the machine level. In some cases, it might make sense to monitor/manage at the process level, since processes for the same application might be run on different machines at different times, but the advantage of Autopilot working on the machine level is that it can re-image entire systems that are failing.
Problem
In large data centers, the large number of machines means that there will be frequent hardware and software failures. One possibility is to keep people on call or at work 24x7 to address these issues, but this is expensive for the company and will lead to disgruntled employees. Thus, it’s important to be able to manage distributed systems automatically in a way that deals with failure and makes large deployments easy.
Contributions
The paper made a couple interesting suggestions for avoiding false positives when detecting failures. First, software deployments cause the machines to enter a probation mode so that failures due to the deployment are not counted are failures that indicate the machine needs to be re-imaged. Second, load shedding can often look like a failure even when it’s intentional, so it’s important to recognize this behavior.
They contributed the Cockpit tool for exploring the data collected via Autopilot. Although such graphical tools are usually of little technical interest, the quality of management user interface will often determine whether the software is used in practice, so it’s nice they didn’t gloss of this detail in the paper.
The importance of being able to roll back is addressed. For instance, an audit trail is left when configurations change. Also, the paper stresses that certain machines that have been experiencing failures should have their data backed up before they are re-imaged.
Flaws
Autopilot requires applications to be Autopilot aware for the logging and other features to work. While it might be feasible for Microsoft and a few other companies to develop their software in house and integrate with Autopilot, the vast majority of companies won’t use a tool like Autopilot unless it integrates nicely with existing software.
Application to Real Systems
The principles of this paper, such as the watchdogs, are definitely useful in real systems. For instance, I was managing a MySQL database server last summer at Qualcomm, and there were some constraints which couldn’t easily be expressed MySQL’s version of SQL, so we had to write scripts that checked for well formed data. We occasionally ran the scripts manually, but it would have been nice to have had a management framework with a watchdog that ran the scripts for us, alerting us via email or other means of anomalies
Posted by: Tyler Harter | April 14, 2011 08:35 AM