CS 739 Review Blog: Data Center Services

I reviewed XTrace because I like reading Scott's papers. They generally propose bold ideas [ which are hard to realize :) ]. But I like the way the papers are written: They accept their weak points and loopholes, and then try to convince readers why it might still work out.

Summary
--------
This paper descries XTrace, a tracing framework that spans multiple applications and layers in the network stack. Reports generated by XTrace can be easily correlated and used for debugging various application/network protocol activities.

Detailed description
--------------------
Network operations are highly causal in nature, and it is often hard to reason about the sequence of events that take place because of an application activity. Past work has looked at tracing a particular application, or tracing at a particular layer or tracing to identify a particular type of problem. There is no simple way of tying all these different reports together to get a full view of causally related network activity. This paper looks at a generic framework for tracing at different levels that involves multiple administrative domains. The basic idea is to assign a task id with a high level operation, and associate all related network operations with the same id. Stakeholders in the communication maintain a bunch of Xtrace metadata, and are responsible for recording and spreading it around. Two primitives, PushDown and PushNext help in propagating this metadata between different layers and tracing sites. Applications and network protocol software would have to be modified to handle these recording and propagating functions.

Contributions
-------------
The main contribution of this paper is in proposing a pervasive solution to the tracing problem. It does a nice work of integrating recording and metadata propagation within the datapath itself by exposing a simple interface to applications.

Applicability
--------------
Its clear that this framework needs a lot of modifications to work, which makes it hard to believe that this will be adopted. Moreover, I dont see incentives for entities to adopt tracing in general:
* Why would a router want to bear the cost of recording XTrace reports super fast to a disk just to aid tracing queries from end hosts?
* It would be very difficult for an entity to decide whether to release a report or not, based on some complex policies. There would be some security & privacy concerns here, I guess.
So, it boils down to partial adoption. Unfortunately, partial adoption does not help a lot in tracing. Worse, it could lead to more confusion and wrong inferences. For example, how would you know if it was because of a packet loss, or because of lack of tracing support for a protocol that an entity did not have some trace report? Because of these reasons, I think XTrace will not be widely adopted.

It feels a little weird to criticize saying 'they need a lot of change to get some small feature or benefit', when I know my research is no better :D. But however, I think its a good thought exercise to see what is the cost of solving such hard problems.

Posted by: Chitra | April 20, 2010 04:42 PM

Chubby is sort of a best-effort distributed lock service. The goal is to provide a distributed synchronization primitive. It should be efficient and easy to use. Paxos would be a good as it provides consistency. Google instead chose to have moderate consistency with higher efficiency. Using locks instead is a better fit for most programmers who understand locking at some level, but have no familiarity with distributed consensus.

Not to say that consistency is sacrificed much. Masters (primary node of a Chubby cell) are still chosen using Paxos. Clients communicate with the masters to manipulate locking, while consensus is used to propagate updates. I think the inconsistency arises when clients attempt to use the locks, and so sequence numbers can be added to many operations. Consistency then just depends on clients not timing out.

Partitioning is as reliable as you could hope for in a consensus system. Immediately after a partition develops, locks and sequence numbers remain valid up to their timeout values. Since the system uses Paxos for consensus, a partition with fewer than half of the Chubby servers will not allow locking operations.

One of the interesting points is that course-grained locking is better in distributed systems. Fine-grained locking is supported, but the Chubby developers will expect that its users will either heed their warning or determine for themselves that fine-grained locking doesn't work well. I guess what this means is that work queues are no longer realistic, and the partition & barrier approach makes much more sense now.

Chubby has trouble with scaling at some point, so they had to use many different approaches to reduce load on the servers. File handles contain check bits to validate them. The alternative might be to look up a file descriptor and check who owns it. Instead, the check bits could be unforgeable and easy to verify, so checking ownership, which may require communication, isn't necessary. Another approach to reduce communication is to piggyback event notifications over KeepAlive channels. Having less communication helps the network more than it does the servers, but it is still advantageous. Caching, mirroring, and partitioning are three other approaches. And don't forget the biggest one: it's a lock service, not a consensus service.

Posted by: Markus Peloquin | April 20, 2010 02:17 PM

AutoPilot

The paper presents a data-center management software - Autopilot being commercially used by microsoft to manage hundreds of thousands of its computers. Its features mkae it robust in the face of failures and requires minimal human intervention. The simple design though not optimal is easy to debug and maintain.

The problem that the paper addresses is one of designing an efficient yet reliable and simple management software that can run a cluster of thousands of computers conneted together in a datacenter and minimize human intervention both to reduce costs and to minimize errors by introducing consistency unachievable by human operators.

The system is a fairly centralized entity with a set of small cluster of computers running all the various services and the core of Autopilot. The various smaller services keep in constant contact with the core through a set of heartbeat messages. These services take care of deployment, provisioning and observation functionalities. The system is not very different from EMULAB that runs on our own WAIL lab. I would expect EMULAB to actually be far more complex and powerful than Autopilot but some of those features required to configure a testbed can actually be useful in modern datacenters where owners are renting out computing power to individual clients who would prefer to have as much isolation as possible.

The system seems to have successfully been deployed which is probably the toughest part of the process for such a distributed system where the design principles for various componenets are mostly borrowed. I am curious to know how Autopilot compares to other enterprise management software like that of Oracle Enterprise Manager. I am also a bit surprised that the paper mentions that Autopilot cannot handle migration of services. I would have expected that a simple extension to the Watchdog services would be to monitor the health of various applications/computers and scale/migrate according to its observations. Another issue not clear here is that of scheduling. Today most datacenters employ smart scheduling to minimize their power consumption and maximize utilization of running processors. But these aspects have not been introduced into this iteratino of Autopilot.

Posted by: Sanjay Bharadwaj | April 20, 2010 02:09 PM

Problem Description:
How to build an infrastructure that will automate several maintenance aspects of data center like provisioning, deployment, monitoring, repairing, updating etc. thereby needing less human intervention and thus reducing the operating cost of the datacenter. This is the problem discussed in this paper.

Summary:
Auto-Pilot is Microsoft's automatic datacenter management infrastructure. The main aim of this infrastructure is to reduce the operational cost and lessen the manual errors. Autopilot has several modules, each performing a specific task. The core module is the Device Manager that supervises and controls the actions happening in the data center and is strongly consistent. It performs synchronization using satellite services that gets the state information from the Device manager and updates the system. Also, Autopilot has a provisioning system that will provisions a newly detected computer with appropriate OS version. It also has an application manager for reading configuration files(manifest) and running appropriate processes. The Deployment service reads that makes sure that the directories listed in the manifest are upto date. To rollout new version of software, the system stores the new version in Deployment server and informs the computers to pull it from there. They also have watchdog services for montioring the system state. If an error is detected, the computer is moved to 'Failure' mode. Once it is repaired, it is kept under 'Probation' before moving it to 'Healthy' mode.

Contributions:
1) Identifying several functionalities required by an infrastructure that maintains the data center management.
2) Having simplicity as the key aspect in building a large-scale distributed system.

Comments and applicability:
It was an interesting paper though it did not have any new concepts. One problem I see with Autopilot is that it also has dependence on the applications running on it to ensure good results. Because of this the legacy applications might not work well. But for the newly created applications, this technique will work really well.

Posted by: Raja Ram Yadhav Ramakrishnan | April 20, 2010 02:08 PM

This paper discusses the design and implementation of Google’s big table. Bigtable is a distributed storage system designed to grow to a very large scale. It borrows some of the ideas from the relational database field, but does not fully implement the relational model.
Google needed a reliable storage system that would support both highly concurrent and data-intensive batch processing applications. They didn’t need a system that supported general transactions but simply atomic reads and writes on the row level.
The list of contributions provided by Bigtable is relatively short. In general Bigtable is the glue between a number of other technologies and services (Chubby, GFS) to provide a solution to meet their goals. That said, there are a few things Bigtable does differently than other large scale storage systems.
1) Unlike parallel databases, Bigtable does not support general transactions but atomic reads and writes on the row level. This, perhaps, limits the types of applications that can be built on top of Bigtable. Highly consistent applications may not work well here.
2) Bigtable allows client applications to store more than just key-value pairs but semi-structured data.
3) Bigtable is a temporal store, holding on to a pre-specified number of previous versions of the data. This is made possible by the immutable properties of the SSTables.
Bigtable is widely applicable in today’s web-scale applications. The paper discusses the use of Bigtable in Google Earth, Writely (Google Docs), Personalized Google Search, and other applications. Both Google Docs and Google Sites keep previous versions of the data which is probably a direct result of Bigtable.
There is however something to be said about consistency. Most of these applications operate over inconsistent data (the web) and under optimistic concurrency (multiple people concurrently editing a Google document). For unstructured data consistency isn’t always important, especially when previous versions of the data are retained. This allows users to manually resolve conflicts. However, if there are being pointers maintained between rows, tighter consistency may be required. If an application requires that more than one row be updated in one atomic action, then Bigtable may not be the appropriate datastore. In short, Bigtable can only support a subset of applications.

Posted by: Sean Crosby | April 20, 2010 02:00 PM

This paper presented Autopilot, an automatic data center management system designed by Microsoft to operate its largest data centers almost completely independent of human intervention. As a system, autopilot is responsible for automatically provisioning and deploying software, system monitoring, and repairing faulty software or hardware through a number of different mechanisms. This paper presents a high level overview of Autopilot but does not go into significant detail on anything.

Microsoft again has taken up the model of using commodity hardware in their datacenters and leaving it up to the software to be fault tolerant. The design principles of Autopilot are simplicity and fault tolerance. These two are very important since one of Autopilot's primary purposes is to in fact detect and deal with faults. Simplicity is necessary in such a large system and the author states that they will use a more inefficient solution to the problem if it goes in line with their goal of simplicity. This can especially be seen in the use of "pull" for message passing instead of "push". If they were using "push" the Device Manager, the portion of the system responsible for knowing the correct state of the system, then that node would have to keep data about each message and who had seen it. The "push" model allows for greater simplicity of the Device Manager in exchange for additional network traffic and greater latency in knowledge of new information. New software is deployed in stages and enough nodes must reach the final stage to be declared a success and then the nodes can begin to run the new software. Also each node has what are called watchdog processes which report periodically to the Device Manager about the state of the system. These are responsible for determining if a node is OK, in a warning state, or has an error. Individual Autopilot clusters are loosely coupled so that failure of one cluster does not result in failure of more clusters.

In the paper, Autopilot has been deployed on clusters of tens of thousands of machines. I am curious if that is about the size of current clusters and if Microsoft has encountered any problems if they have had to try to scale Autopilot to larger clusters. Also I think a "lessons learned" portion of the paper would have been useful because they discuss the current state of the system but provide no insight into how they arrived at each of their design decisions. They also don't discuss what happens if the Device Manager fails. I know the Device Manager is replicated but what if enough of them fail that the remaining Device Managers cannot handle the load? Autopilot itself does not attempt any load balancing so can it reassign machines to serve as Device Managers? I also thought the case study was weak, not detailed, and certainly not technical.

Posted by: Elizabeth Soechting | April 20, 2010 01:54 PM

Big Table
---------

Summary
--------
The paper presents the design of BigTable, a distributed storage system for storing structured data, which achieves high scalability, availability and performance, and provides flexibility of data layout to its users.

It does not follow a strict relational model like the data bases, but uses a multi-dimensional sorted map based data model that provides dynamic control over data placement to its users. Tables store values in cells defined by row, column and timestmap. Rows are grouped together into row ranges that is called a tablet. Each tablet is served by a tablet server. Columns are grouped into column families. This grouping of rows and columns is the way clients have control over data layout.

BigTable is a set of tablet servers and a master. It is organized as a distributed B+-tree for efficient tablet server location. Master takes care of assigning tablets to tablet servers, and tracs the liveness of tablet servers. The data in each tablet server is stored in a log fashion in a set of SSTables, that are compacted periodically in the background. Journalling is used to maintain consistency.

Problem Description
--------------------
The paper strives to put forth an efficient data storage model that provides control over data placement to its users.

Contributions:-
--------------
It proposes a newer data storage model based on tablets, locality groups that provides transparency and control of data placement to clients. It groups together similar things but provides the users the power to define these similar things. The paper also details the complexities in implementing such a solution and the various techniques used.

Relevance:-
-----------
The paper quotes 388 google applications using bigtable. That is an ample proof that this restricted relational model fits many different data sets.

Posted by: Laxman Visampalli | April 20, 2010 01:52 PM

Summary:
Google BigTable is a distributed system for storing large amounts of structured data in a reliable and efficient way.

Problem:
Google has tons of data and they needed a way to store it in a structured fashion. While a relational distributed database would generally fulfill this need, there are a number of issues with such an approach: 1) it would not utilize much of the existing system and infrastructure that Google has, 2) it may require using expensive proprietary systems from a third-party, and 3) it may not have the flexibility needed to efficiently fulfill the specific tasks required by Google. So, Google engineers decided to create a homegrown solution.

Contributions:
BigTable structures data in three dimensions: rows, columns, and time. The set of all rows is partitioned into tablets, which are then assigned to a cluster of distributed tablet nodes. Clients search for tablet location through a three-level lookup table (with client side caching added for speed), and the placement of nearby rows together on one server saves in lookup overhead. Within rows, columns are split into families, which store similar types of data. Column families are used for things like access control and they can be assigned into locality groups for more efficient client access. The third dimension, timestamps, allows for multiple versions of the data to be stored. The system can be tuned to store a certain number of versions and cleanup old versions. This three-dimensional basic storage model is built on top of a number of other systems: GFS for storing the actual data on the distributed nodes, SSTable as the filetype used to store data, and Chubby, a distributed lock service based on the Paxos algorithm. Using these systems, BigTable itself is a level of abstraction above most of the concerns of distributed systems, though in the paper they do discuss problems that arose when they ran into bugs in some of the lesser used parts of Chubby.

Thoughts:
Not having any database/data storage background, I feel like it is difficult for me to give a meaningful or well informed comparison of BigTable to a relational database. The fact that this is a production system in use by many of the services I use quite often does give it substantial credibility as a good design over something purely academic. I think the biggest win for Google in having BigTable is their ability to tune it to the exact needs of their services without third party intervention or high amounts of boilerplate overhead.

Posted by: Ryan Johnson | April 20, 2010 01:47 PM

Summary:
This paper describes the design principles of the automatic data center management infrastructure developed by Microsoft, the Autopilot.

Problem Description:
AutoPilot is designed and developed to automatically manage the large-scale data center in Microsoft, built on top of large number of commodity computers. The motivations of Autopilot include:
-keep the total cost of managing data center to be low.
-Make the system to be more reliable and maintainable, by reducing human work.
This paper focused on the first version of AutoPilot and mainly gave the high level design principles of AutoPilot. The two most important principles that followed by AutoPilot are fault tolerance and simplicity.

Contributions:
AutoPilot provides provisioning services, deployment service, repair service, monitoring services, etc. The Device Manager is a master component of Autopilot. It maintains a strongly consistent state machine of the system. Worker computers interacts with the Device Managers in a pulling way.
The fault detection and recovery model of Autopilot is simple but works. It uses watch dogs to detect and report faults. Both the protocol and the error determining logic are quite simple. It also automates failure responses. Applications can also have their customized fault tolerance policies.

Applications:
Actually most of the techniques used in Autopilot’s components already exist in past work. However, this paper shows their tradeoffs, design choices, configurations, etc, that comes from their careful examination and practical experience with Microsoft data center. Although the functions supported by first version Autopilot are limited, it turns out to be very reliable and the cost is relatively low. Obviously this tool is meaningful, because large-scale data center are becoming popular.

Posted by: Chong | April 20, 2010 01:43 PM

Bigtable
Summary:
To provide a specialized DBMS for Google's application.

Problem description:
A traditional DBMS has many features that Google doesn't need. What Google needs is a simple database system that just does the right amount of work. In this paper, the author laid out the design points of building a database system that suits Google's need.

Contributions/highlights:
(1) Define a model of a multiple dimensional table,each entry is indexed by three keys (row, column and timestamp). The data is lexicographically ordered by row key for careful users to exploit the data locality if they also send queries lexicographically ordered .
(2) The table are both column and row indexed, different from the traditionally column-indexed technique.
(3) It provides user the ability to control data locality.
(4) It uses Chubby to provide distributed lock service and uses a single master to handle fault tolerance.
(5) It only provides single row transaction.

Applicability:
As the paper pointed out, that many Google's applications depend on bigtable, together with mapreduce, it has been used in many Google project(such like google map, etc). It fits very well regarding Google's requirement. My sense is that bigtable doesn't have to maintain the ACID property of traditional DBMS, because its results need not be accurate. Thus, it can make a balance between accuracy and efficiency.

Posted by: Wei Zhang | April 20, 2010 01:38 PM

**BIG TABLE**
=============

*Summary*
---------
This paper present a data model, design and implementation of a distributed storage system for managing structured data that is designed to scale to thoudsands of machine with peta bytes of data.

*Problem*
---------
The problem tackled in this paper is how to manage *structured* data in distributed system. This is not a trivial task because a lot of thing can happen. A node failure, a network partition … , all can cause data inconsistency. And the hardest thing is how to ensure data consistency in the presents of multiple readers and writers, and guarantee some atomic updates in spite of failures.

*Contribution*
--------------
I think there are two major contributions of this paper: the data model, the design and actual implemenation of BigTable.
- First is the data model. In some way, BigTable is like databases, but it is not strictly a relational database system. It give a client a simple data model so that client have fully control over data layout and format. The data model in BigTable is a multidimensional sorted map. Each value in map is indexed by a row key, a colum key, and a time stamp. This model is good enough for a lot Google applications.
- Big Table is implemented on top of GFS. It uses GFS to store data files and log. For ensure atomicity and consistency, BigTable uses Chubby, a highly available and persistent distributed lock service. Rows in BigTable is partitioned into tablet. Each tablet is stored in a small set of tablet servers (hopefully one). There is a master in BigTable to manage the mapping between tablets and tablet servers, the joining in and going out of tablets server, the change in schema… and so on. Each client can contact tablet servers directly to read and write data.

*Flaw/Comment/Question*
-----------------------
One thing I like about this design is that unlike GFS, the master here is not a single point of failure. The crash of master does not affect the system seriously. The client does not necessarily have to contact to the master before every read/write operation. It can come to tablet servers directly.
Few point I am not clear about: Why BigTable needs Chubby while GFS does not? Also, when a tablet server crashes, how does the system ensure avaiability? Is it depends on GFS to do that?

Posted by: Thanh Do | April 20, 2010 01:15 PM

The paper "Bigtable: A Distributed Storage System for Structured Data" describes Bigtable, Google's database-like storage system. The goals of Bigtable are to provide a data storage system that is scalable, reliable, applicable to many tasks (especially ones that Google has), and high performance.

Bigtable is a map indexed by row key, column key, and timestamp. Each row is read and written atomically. Data is ordered by row, and the row ranges are dynamically partitioned. These row range sections are called tablets. Because of this, row keys that start with the same thing will be on the same or neighboring machines, and clients can use this knowledge to choose row keys to exploit data locality. Column keys are also grouped, into column families. These allow some access control. The timestamps can be either automatic or specified by the client, and there is garbage collection to keep the number of versions under control. Bigtable runs on top of GFS, and uses Chubby for locking. To keep memory usage reasonable, the tablets convert their freeze their memtable when it gets too big, convert it to a SSTable, and write it to disk. This additionally is helpful when they need to recover after a server dies. There are other refinements as well, such as bloom filters, compression and locality groups.

The paper contributed the idea of storing the rows in a defined order, so that the users have the option of choosing their row keys in such a way that they can have good locality. They also mention that originally they were planning to implement general-purpose transactions, but held off on it because they did not have any immediate use for them; they later realized that they did not really need them.

Bigtable is used in many of Google's products, and the performance section of the paper scaled well. However, it mentioned that tablets are unavailable for "typically" under a second when they are being moved, which would prevent much load balancing, especially in very latency-sensitive applications. It would also have been interesting to see a comparison between Bigtable and some other database, just to see what the difference in scalability etc really is.

Posted by: Lena Olson | April 20, 2010 01:11 PM

The paper presents Bigtable, a distributed storage system for structured data. Key challenges to building such a system are providing scalability, availability, and throughput. Bigtable identifies two broad goals for the system - catering to extremes in terms of both the data size and the latency requirement. It uses a bunch of other in-house systems namely Chubby, SSTables, and GFS to achieve its goals.

The goal of the system is to provide distributed storage for structured data. The system is particularly tailored for workloads generated within google and hence, exploits many of the workload patterns. It presents a distributed, persistent map that allows the storage of structured data. It is similar to relational databases in many ways but it provides a very simple data model giving additional client control over data layout, deployment and format.

The key contribution of the paper is identifying the various design decisions needed to cater to the specific nature of workload at hand. The smallest unit of data is chosen to be a tablet. A master server manages the tablets and a set of tablet servers and achieves load balancing through the same. The data is stored lexicographically which makes data partitioning easier. Syncronisation among multiple tablet servers is provided by the underlying Chubby locking system. Reliability is provided by the GFS on which the system is built. To optimise read performance, memtables, which contain the recent updates to the tablet, are maintained at each of the tablet servers. Commit logs are maintainted for recovery from failures. The SSTable abstraction is used for data storage. Being immutable, this ensures that there are no concurrency issues. To bound the size and number of SSTables, the authors use a bunch of compaction schemes. Locality groups and other parameters are provided for the client to have higher control over data.

Bigtable inherits the benefits from the underlying systems it depends on, like GFS, SSTables, and Chubby, because each of the sub-systems were designed and built in-house and are tailored for applications and workloads within Google. As the paper mentions, the system is used by a large number of applications like Earth, Search, Analytics which are pretty data-intensive. Would it be possible to build a relational database system on top of Bigtable? A discussion on whether general purpose relational database systems would be able to provide the same performance and scalability for the above workloads would be interesting.

Posted by: Deepak Ramamurthi | April 20, 2010 12:49 PM

Summarization:
This paper introduced Bigtable, a distributed storage system with high performance, high availability, high scalability and low latency, and is able to handle petabytes of structured data. By storing data in a lexicographical order by row keys on different machines and are dynamically partitioned, programmers could control the locality of data. The write operation writes update to a log file as well as a memtable, and a separate periodic process will compact these old memtables into files. In this way, both read and write operations are fast which make it suitable for time sensitive jobs.

Problem Description:
The problem this paper tries to solve is to build a highly performed, highly available and highly scalable storage systems for managing structured data. As was pointed out by the authors, the goal of Bigtable is to meet the specific needs of Google’s internal use; therefore, it is different from other systems in its particular design choices or framework assumptions. For example, bandwidth in Google is highly available, machines are unlikely to be reconfigured often, and most of the applications do not need complicated transactions and relational database model. The importance of this paper does not lie in their methods or algorithms, but their successful deployment of such system on to tens of thousands of machines handling petabytes of data.

Contributions:
This paper did not create new algorithms or methods, but rather an summarization of their existing system. However, some of their outstanding design choices are worth mentioning.

First, the choice of an in-memory memtable and on-disk immutable SSTable files is a clever design choice. This design makes both write operation and read operation very fast. The write result does not need to be synchronized to disk; a separate background task (compaction) could combine the recent updates and write to disk in a bundle. Moreover, the immutable SSTable files make it very easy for simultaneous accessing without complicated synchronizations. This design makes Bigtable suitable for latency-sensitive jobs.

Second, the explicit use of lexicographic order to store rows is also a clever design. Most of other systems either store data transparently so there is no locality information available to users, or provide complicated interfaces for user to control where to store data while sacrificing simplicity. Bigtable explicitly claims its lexicographic order of storing rows of data, so programmers caring about locality could design their row keys in a desirable way while programmers who do not care about locality could easily ignore this information. The explicit inheritance of a determined order makes the system both flexible and simplicity.

Applications/Problems:
Like MapReduce, Bigtable is built for specific tasks and is not a general purpose database system. It has been already successfully used in a variety of different jobs, both latency sensitive jobs (like Google Earth) and data size demanding jobs (like Crawling). It has its own restrictions: only single-row transactions and schema is predefined. However, building a general system could always been performed worse than these specialized systems, and most of the jobs (as pointed out by this paper) do not even need these additional fancy functions.

I have, however, one question regarding this paper. This paper mentioned load balancing for Bigtable by transferring tablets from one machine to another. But the transfer procedure itself imposes a large cost, especially if the load of the machine changes rapidly. The performance also shows that load balancing cannot help much when scaling Bigtable. Therefore, does such load balancing occur really rare in real systems?

Posted by: Shengqi Zhu | April 20, 2010 11:24 AM

Autopilot
Problems
Managing large scale cluster requires huge amount of man power to make sure entire systems up and running especially when systems are made of commodity hardware. The paper deals with automatic deployment, monitoring, and provisioning of datacenter-scale cluster. The key issue here is that how we can avoid human interactions to each possible error cases, deployment scenario, and failure recovery but are handled by rules of the system.
Summary
The paper introduces Autopilot that does datacenter-wide system management. Autopilot provides a package of features. First, it provides central management console. Operator does not deal with individual host but system-wide configurations. Second, it tolerates fault and provides automatic recovery. Autopilot assumes any component can fail. Autopilot provides systematic approach to recover from failure. Third, it provides provisioning. The operator can easily deploy newer software or service to thousands of hosts. Fourth, it provides centralized monitoring and logging. Autopilot stores log to SQL database. So, the operator can easily generate a report and a graph from centralized log db.

Contribution/Comment
The main contribution of autopilot is that it provides clean architecture. Rather than tackling individual problem, it integrates every problem into single problem domain and presented neat architecture. Device Manager is key design element that eases development of other components.

Another key contribution is that it enables automated recovery without human intervention. For predictable failures, the system can recover from failure autonomously due to its simple failure recovery process. So, it does not require 24x7 monitoring person.

Many issues address here - deployment, monitoring, logging, - are common for many datacenter level applications. When I was working for online game company, I faced similar environment. There were few hundred machines. Each machine runs game server or database server. Every machine was managed by human operators. Although we relied on scripts to manage these systems, there was no systematic approach to deploying, monitoring, and recovering from failure. It was error-prone because operators could make mistake in every cases. Autopilot seems to be invaluable tool for them. When I read this paper, I wished I had these tools when I was in the company.

Posted by: MinJae Hwang | April 20, 2010 10:35 AM

Goal:
AutoPilot is an automate tool for administering computer in data center environment. It can automate provisioning, service deployment, monitoring and repairing machines.

Problem:
For internet-scale service, commodity hardware is the most cost efficient solution for building the system. However, software fault-tolerance must be implemented to handle high failure rate of these hardware.

Contribution:
AutoPilot is collection of satellite services and a core service called Device Manager (DM). DM is implemented as replicated state machine using PAXO as a consensus protocol. It store the configuration of the computers within it cluster. DM adopts strict-consistent approach because of simplicity. Additionally, all configuration changes produce audit trails. Its satellite services constantly pull information from DM and take action on target machine when necessary. The satellite use weak-consistency model because the overall system can fix insistency by using repair operations.

In terms of provisioning, AutoPilot constantly probes new machines which it will automatically install an OS image and run burn-in test. For deployment, it will deploy software binaries according to DM and run software that is marked as active. In terms of repairing, available actions are only reboot, reinstall or replace. Monitoring is done through single service which stores all performance statistics into SQL database.

Applicability:
For datacenter will huge amount of nodes, a system that provides complete functionality starting from provisioning to repairing is important for reducing the administration cost. In Linux world, a collection of tools are required to achieve the same set of functionality such as configuration management and monitoring. As a result, those tools may require similar common feature such as agent on the machine which is redundant and inefficient.

Flaws:
Normally, configuration or monitoring tools have to follow application standard; however, Autopilot force the application to follow its method in many ways such as no-graceful shutdown is available. Additionally, many problems that mentioned here are not found in Linux world such as requiring explicit file transfer service or managing name service. This shows that Windows operating system lacks many features that facilitate the management in datacenter setting.

Posted by: Thawan Kooburat | April 20, 2010 09:59 AM

I chose Autopilot to read because I think the management and deployment within a clusters of many computers is the crux of availablity, reliability and serviceability of a distributed system. This paper is more of a higher level specification of Autopilot product other than a conference paper. There is not benchmark or data in this paper. I guess it is because there not exist a resonable and widely accpeted benchmark for a data center management system. Autopilot aims addressing the problem of managing a large amount and ever increasing computers, which suffer from software upgrade constantly.

Problem:
The data center failures caused by human manipulation error account for a significant amount of errors. Even worse, new errors are often be brought up from attempting to fix an earlier problem. Thus automating the manual work lowers down the variability in the response to faults and make the entire system more reliable and maintainable.

Contribution:
- Autopilot automates the management of computers in data center. It provides the basic features required for keeping a data center operational, such as provisioning and deployment, monitoring, and the replacement and repair of hardware. Like what we have learned from Dynamo, it trades the availability for the consistency and relies more software to achieve fault-tolerance. The author believe that simplicity is of the same importance with fault-tolerance when building a large-scale reliable and maintainable system.

- Autopilot system uses device manager to provide the Central system-wide authority for configuration and coordination. And satellite services such as Cockpit, Collection Service, “lazily” assist the central device manager by performing some state update. Three states (Healthy, Failure & Probation) of the hardware are employed by the Device Mangeer to predict the status of the machine. This modal makes the work much simple.

- Watchdogs act as a subsystem of Autopilot to detect faults and repair them that happened in the monitored computers. The can run either on the monitored computers or on a set of computers called the Watchdog Service.

Comment:
In fact, IBM has similar technology name Resource Management Concole(RMC). IBM use RMC to mangement the IBM servers, no matter they exist in clusters or as a stand alone server. It seems hard to compare such management system. Different systems are designed based on different requirement. However, I find the most useful section of this paper is "DESIGN PRINCIPALS". Eventhrough, I believe a good design origins more from practical experience.

Posted by: Deng Liu | April 20, 2010 07:53 AM

Comment: Did not realize until now that everyone else read Bigtable as well.

Google created Bigtable to act like a database for their structured data, but it didn't need all the relational stuff with normal database systems. To optimize for the Google architecture, simplifications were made and the architecture of Bigtable was fit to that overarching architecture. The applications it is used for attest to its viability, while the scale of the system is clearly gigantic. This all leads to the perception that this paper is greatly Google.

The problem presented is basically how can we get a database like system without paying out big bucks for a system like Oracle? Utilizing the machines we have already looked at (commodity hardware in bulk), the goal is to have lots of data organized so it can be easily accessed. This is where the problem is sticky. Lastly, the continued distributed systems problems of availability, consistency, and scalability are tackled when they come up.

I don't see a lot as far as breakthrough contributions in this paper. Really it is just organizing a bunch of ideas that seem to work well together and showing that it can work for some of Google's services. The main take-away is that breaking up Bigtable into tablets which are bite-sized (for the servers at least), there is a lot of parallelism and locality that can be exploited. I found the compression section interesting, in that they stumbled upon something that seemed to do better than they expected. But this and the other ideas are not very exciting.

The real issue with Bigtable is that it really only makes sense when running on a Google sized system. There might be some companies that get close enough in size that they can gain some benefits from this, but only when data is stored in such vast quantities on so many servers does this make any sense. It is called Bigtable, so I can't fault it too much for this, but really the problems solved are ones that are already solved. The only applications presented that make this paper nice are Google-like services run on a Google-like architecture. But it works well for that.

Posted by: Jordan Walker | April 20, 2010 07:23 AM

Bigtable:
Summary:
The paper presents the complexities involved in building a general-purpose distributed storage system for managing structured data. The data is stored in a sparse, persistent, multi-dimensional sorted map that is indexed by row key, column key and a time-stamp. The system is fine-tuned to satisfy the needs of different types of jobs – throughput oriented batch processing or latency-sensitive serving of data. A library on client side is used for caching the location of tablets accessed. A single master tablet takes responsibility of load balancing, addition or removal of tablets, garbage collection. It does not become a bottleneck because clients skips master and accesses the tablet that has data directly. The system uses Chubby distributed lock service based on Paxos algorithm to ensure single master, detect failures and guarantee consistency. The system also uses GFS for persistence and maintains memtables in memory for better performance. Various enhancements for better performance like using bloom filters to reduce number of disk accesses or compression to reduce storage space are also implemented.

Contributions:
Discussion on challenges implementing storage system and various optimizations.
Coming up with a data structure similar to B-tree (in terms of depth) in a distributed environment.

Applicability:
The implementation is very complicated and definitely helpful to application level programmers reliability, scalability are taken care. The paper says it is applicable to both throughput oriented batch processing jobs and latency sensitive jobs. Discussion on how it helps both types of jobs, trade-offs involved and comparisons with systems that focus on only one type of jobs would have been helpful.

Posted by: Satish Kotha | April 20, 2010 07:13 AM

I read the BigTable paper for my reviews.

Problem :-
Build a highly scalable distributed storage system for storing structured data. The scalable and flexible storage system is meant for providing a persistent storage facility to be used by other products.

Summary :-
BigTable is Google's in-house built highly scalable distributed storage system for storing structured data. It arised out of Google's need for a storage facility for managing and storing the huge amounts of data belonging to its various applications. The paper presents the design of the various components used by the BigTable store and the developer experiences from building, deploying and running a datastore of this scale. BigTable uses other Google components such as GFS, SSTable file format and the Chubby Lock service. The paper also descibes some examples of Google's applications using the BigTable and the way they use BigTable to satisfy their different requirements in terms of storage, availability and performance ( in terms of latency, locality etc.).

Contribution :-
As with other systems discussed in class, most components discussed in the paper were not difficult to understand but it is the choice of these individual components that are critical for the achieving the goals of the system. The rows are stored lexicographically which makes it easier for application to use locality for better performance. BigTable uses a master model and uses a hierarchy of "tablets" ( containing metadata and data) to scale across multiple servers. BigTable is an example of a system that uses Chubby to be keep track of the system and recover from failures. The paper discusses a bunch of other techniques such as Compaction ( for resource reclamation), using in-memory locality groups, bloom filters for checking data availability before doing disk access, optimizing commit-log implementation for faster recovery.

Othe Comments :-
Most of Google's applications are very data-intensive and have a very high requirements in term of storage related requirements such as space, availability and performance. So, the paper provided a good insight into an important "highy scalable" persistent storage mechanism used by Google to fulfil its requirements. I would like to know the performance benefits of using BigTable compared to using parallel databases that expose a richer interface to the clients.

Posted by: Ashish Patro | April 20, 2010 07:09 AM

Review of BigTable

BigTable is designed to be flexible and scalable to very large amount of data. Google stores data in BigTable and applies many techniques like GFS, MapReduce and Chubby to build and access BigTable. It provides low latency of data request and ability of handling large scale of data.

The data stored in database are highly structured. It is not flexible enough for some applications. BigTable aims to be flexible enough to handle semi-structured data with low latency and high scalability. The challenging of BigTable is how to design the structure of this data storage system to work well with various types of data. Google needs to handle large amount of data every day. BigTable is designed to handle large files as well as small files. How to balance the flexibility and efficiency is a big challenge.

BigTable uses a cluster of commercial machines to handle this huge scale of data for Google. In this paper, the authors also contributed some experience they gained during the design, implementation and evaluation of BigTable. BigTable borrows many ideas from C-Store. Their discussion of how BigTable related to other projects is very interesting. It in some level provides us some hints of how to bring projects into real applications and how the ideas of these projects may perform when handling huge amount of data. Meanwhile, BigTable shows its advantage in handling large scale of data when using commercial machines.

BigTable system uses a shared-nothing architecture. One problem facing by this architecture is the load balancing. Their solution of load balance raises some difficulty for new users. They require users to tell which files are important and need to be kept in memory. I think sometimes users cannot predict their access of data correctly. It is so hard for new users to get used to this. Though I don't like their solution for load and memory balancing problem, I like the general idea of BigTable. They successfully figured out a very good storage solution for Google. And I think BigTable also has huge potential to become capable to provide service for many other projects which needs to handles large size of data.

Posted by: Lizhu | April 20, 2010 06:57 AM

Bigtable: A Distributed Storage System for Structured Data

Summary and description:

The paper describes the design and implementation of a "sparse, distributed, persistent multi-dimensional map", although the system can be thought of more as a stripped down version of a database system optimized for use by services inside Google. The logical abstraction provided by the system is simple - a map with values assigned to a set of three keys (row, column and timestamp). Records which are alphabetically near each other w.r.t. the row keys are tried to be kept in the same machine. All values with the same row key can be modified atomically. The system can be configured to automatically remove entries based on the timestamps. The system is designed on top of other Google mechanisms (such as Chubby), and is optimized with various techniques such as compression.

Contributions:

Being mainly an implementation-paper, the paper does not seem to have introduced any new research concepts. However, the following can be considered contributions of the paper:
1. The basic design of a stripped down distributed map/datastore that is well optimized for services such as ones provided by Google.
2. The paper describes optimizations such as locality groups, compression and bloom filters which were introduced to achieve high performance.
3. The paper presents, in a useful manner, the experiences gained while designing the system in the "lessons" section.

Applications and thoughts on the paper:

BigTable is being used by many systems within Google (Google Analytics, Earth and Personalized Search). However, would it have been easier to run these applications with other database systems? Funnily, the introduction section and related section make a mention of something like "BigTable differs from these (other) systems by providing a lesser number of features compared to an RDBMS". Does that mean that other systems perform as well as BigTable and also provide more features? BigTable does not seem to have been really used in places other than Google, and it's feature-cuts seem to be designed with only Google's services in mind.

The paper seemed to be slightly verbose, and it looked like the authors could have described how the components fit together in a more brief manner. On the other hand, being mainly an "experience paper", the detailed explanations might help people to develop similar systems in the future.

Posted by: Thanumalayan Sankaranarayana Pillai | April 20, 2010 06:12 AM

The BigTable paper presents about the distributed storage system developed by Google to store structured data. This system shares many characteristics with a parallel database system. The main goal of this system is to provide scalability, availability and throughput. The data model is an extension of basic relational model. This model serves the purpose since data constraints, data dependence, complex query evaluation are not needed.

The system has collated ideas from various sources and has been collectively incorporated. Data partitioning into tablets as in paralled DB; Tablet localtion hierarchy like page tables or indexing structure; Locality Group is similar to column oriented DB; commit log mechanism similar to DB's write ahead log

Few problems tackled by the system (i) locality of data - classify/index the data such that required set of data are close to each other. However this mainly depends on row keys chosen by the application. It can also be improved by indexing based on the required attributes (ii) Reduce latency/Increase throughput - To reduce access to GFS thereby avoiding disk access and also network latency. This is done by caching at the tablet servers. Also, this is achieved by the locality group refinement where in the columns groups are stores in different tables. The above point helps in improving throughput (iii) To reduce the size of the data stored by using compression techniques. This can be also be done at application level as is done in the case of Google Earth. (iv) Server recovery by using logging mechanism.

Contributions
The system is highly scalable that machines can be added/removed from the cluster at any point. Since the data is partitioned as tablets, it is easy to assign a tablet to any server in the cluster. The system does not have any idea of the semantics of the data. This makes it to support a variety of applications over them.

Questions
(1) Does the tablet server provide any support to execute query at server's end ?
(2) Is it possible to service a tablet by more than a single server ? (The obvious challenges would come into picture when there are updates to the tablet data)

Posted by: Sankaralingam Panneerselvam | April 20, 2010 06:10 AM

Bigtable: A Distributed Storage System for Structured Data
Summary:
This paper introduces google’s distributed storage system, Bigtable, which provides flexible and high-performance solution for managing large-scale structured data.

Problem:
Different applications have different requirements (like file size, throughput, latency, etc.) The problem is how to design and implement a scalable, reliable and flexible platform that could support this variety of workload and applications.

Contributions:
Bigtable is much more flexible than traditional database system in several aspects:
1. Bigtable provides a different interface, which enables the clients to dynamically adjust the data layout and format, and get enough information to reason the locality of data.
2. Row and column which are index for data can be arbitrary strings.
3. Bigtable provides user the ability to dynamically control whether to serve data out of memory or from disk.

Questions/Comments:
I think Bigtable in some sense is similar to database system. But it reduces the strict format restrictions in DB and adds many storage system functionalities in it. So I’m just wondering: with the bigtable, SStable format and GFS, how do we define the different role of DB and File and storage system? What’s the standard to allocate different functionalities for them? It seems there is not a clear boundary between them in Google’s design.

Posted by: Ao Ma | April 20, 2010 06:08 AM

This paper gives an overview about the design of Bigtable, a distributed storage system for managing structured data. Unlike the conventional databases which provide an abstraction of relation, Bigtable provides a two dimensional sorted map storing multi-version values as contents. Data is internally indexed using row and column keys for speedy access. The table values are stored in the sorted order of row keys, dynamically partitioned as tablets.

This paper presents a system that provides a flexible layout and format for persisting the structured data to the disk. The system is designed to scale to petabytes of data accessed on varying demands of workload. Query and transactional processing capabilities are not supported by the system to reduce the complexity as it is not required for streaming data applications of Google products. An interesting aspect is that the system supports execution of client side scripts at server addresses space there by avoiding the unnecessary data movement.

Bigtable is designed from various Google's fundamental blocks: - SSTable, Chubby lock service and GFS. A table is broken to set of tablets and each of them is persisted in a known GFS location. The internal storage format of each of these tablets is set of SSTable blocks. A Three level B+Tree hierarchy is used for a faster lookup of tablets pertaining to a given row key of a table. The distributed system elects a master that is responsible for tablet assignment to cluster of servers. Updates to a tablet are done through redo logs stored in memory in a sorted buffer called memtable. Compactions are performed in background to create new SSTable based on redo logs. The older ones are recycled through a garbage collection mechanism.

The system designed to be very specific to Google's demand, is not easily extensible to a generalized data storage system. The idea of not providing the abstraction of relations makes it frequent that redundant data without any normalized form is to be maintained in the storage. Further, the idea of not providing transactions across multiple rows makes the system not utilizable for OLTP system. All the complexities involved in a database system in query optimization, processing, transaction processing has been pruned off to make the system simple. However the fundamental blocks of DBMS for updates (redo logs) and consistency (two phase commit) is used by this system.

Bigtable provides incremental scalability of system by cluster of servers through tablet assignment. However the system uses GFS for its underlying storage which again is distributed file system providing load distribution and scalability. The dominant processing time for table updates is going to be streaming of data from either SSTables or logs. It is not clear from the design why another level of load balancing needs to be implemented at granularity of tablets among bigtable servers when there are already mechanism in place for GFS. It also seems that there is no location-aware placement of tablets w.r.t GFS datanode servers. The tablet servers in turn would not offer any scalability if all of the tablets get clustered together at the same set of replicas of GFS.

Posted by: Rajiv | April 20, 2010 06:06 AM

BIGTABLE

Summary:
This paper described Google's scalable distributed data storage system, BigTable. The authors detail the design of the system, provide some rationale, and discuss specific implementation issues and optimizations. The paper closes with a performance evaluation and discussion of lessons learned.

Problem:
Storing vast amounts of data that often has limited structure is a challenging problem. Traditional relational database systems are not a good fit for Google's many varied workloads. The authors describe a novel data storage system that is less flexible than fully relational databases, but is sufficiently powerful for essentially all of Google's needs and also performs and scales admirably. BigTable is a simple three-dimensional data store, with time in the third dimension. To function, it takes advantage of other pieces of Google infrastructure, namely MapReduce, GFS, and the Chubby lock service.

Contributions:
To my knowledge, this is the first paper to propose a data storage system that is not either a relational database system or a simple key/value system. The paper's proposed system, BigTable, is somewhere in the middle. The primary contributions of this paper are probably less BigTable itself, and more the explanations about why BigTable is a good fit for Google and its workloads. BigTable provides a relatively simple abstraction and interface, but the users can control key aspects. For instance, the ability for users to control locality is atypical for a storage system, but it makes sense in Google's case.

Applicability:
The impact of BigTable is not clear. I remember reading somewhere that Google is moving away from BigTable to a new storage platform, so it's not even clear that Google still even uses BigTable. The lasting impact of BigTable is probably its description in combination with MapReduce, GFS, and Chubby. The four services complement each other very nicely and provide a clear separation of concerns, policies, and functionality.

Posted by: Marc de Kruijf | April 20, 2010 05:58 AM

Summary:
I reviewed X-Trace: A Pervasive Network Tracing Framework. I was attracted to this paper because an end-to-end trace of network activity for a causally related sequence of events is both extremely valuable and difficult to obtain. This paper proposes a framework for network tracing in a layered OSI style network stack. In addition to this framework, they discuss some of the difficulties of tracing causally related network activity through application logic. They also consider the interaction of different network layers (OS, application, library, link level).

Contributions:
The design presented in this paper is extremely ambitious. The authors describe a tracing system in which metadata crosses organizational and application domains along with the data packets themselves. Any party can initiate tracing but detailed trace data is stored within each organizational domain and subject to that organization's policies. This distributes the storage and processing load across domains as well as allowing different portions of the network path to opt out of tracing.

The supplied primitives, PushDown() and PushNext(), present a simple but effective interface for communication between different layers and nodes in the network. The required modifications to network protocols are also small as most well designed protocols leave provisions for arbitrary options.

Questions/Comments:
While the goal of this paper is compelling, I found the realization in the paper extremely unrealistic and disappointing. The authors acknowledge that many have tried to achieve something like this in the past without success, but they fail to address the most difficult aspect of end-to-end network tracing. While it is possible to modify existing code with PushDown() and PushNext() calls this requires: a) source code must be available b) someone with access to (and understanding of) source code must be willing add these annotations. c) modifications require a good understand of application semantics to causally link network activity d) legacy applications make a,b and c much more difficult e) Every level of the network stack must be modified.

In addition to the software difficulties, a huge amount of network hardware is already deployed. Much of this hardware performs packet processing either fully or partially in hardware, leaving no software upgrade path open.

Finally, the overhead as measured by the authors is huge. An increase in latency coupled with a 10% throughput decrease is a very high price to pay for this functionality. While this would extremely useful, I don't think the overhead is acceptable. I also see no practical way to deploy such a system for political and economic reasons alone.

The paper mentions previous work that uses statistical techniques to infer causal relationships in traffic. I believe that passive techniques like this can be extremely powerful and avoid all of the above issues. I would like to see more development in this direction.

Posted by: Joel Scherpelz | April 20, 2010 05:57 AM

I read Google Bigtable (and referenced GFS).

Summary
Bigtable is a scalable and distributed storage system targeted for a variety of workloads (small and large file sizes, throughput and latency sensitive). Bigtable looks something like how a database might be designed if it was written with a scripting language. That is, keys and data are essentially type-less (byte strings) and columns can be added/removed/changed dynamically.

Problem
Although not explicitly stated, it seems as though Bigtable is looking to overcome some problems and limitations of Google File System. These include the ability to provide low latency and effectively handle small files through semi-structured data, but to do so without hurting the performance for large files or high throughput/batch-type jobs.

Contributions
As far as I can tell, this style of storage system is unique. That is, Bigtable is a cross between a block storage system and a rigid, structured database. It is semi-structured and supports limited queries. Nonetheless, it has found seemingly great success within Google across a wide variety of applications.

Unlike certain (recent) papers we’ve read, the authors pay a good deal of attention to fault tolerance and how recovery occurs. There’s also an interesting point about compression worth mentioning. Due to an interesting interaction between data locality when storing information and the compression algorithm chosen, they get good compression/decompression throughput with respectable compression ratios for certain applications.

Applicability
Google makes this case themselves. The Bigtable service was used by over 60 projects four years ago. I would guess that Google App Engine also uses Bigtable for its Datastore. Many of the features appear similar, such as the key-value store, custom query language (called GQL for App Engine), and semi-structured table format.

We’ve learned general principles in distributed systems and learned about a number of distributed systems now. Despite some of our emphasis on availability and fault tolerance, it seems many distributed services employ something along the lines of a “single” centralized master. Is this due to simplicity? Due to the observation that any specific fault is unlikely? The success of fail-over techniques? Something else?

Posted by: Jesse Benson | April 20, 2010 05:11 AM

I read the Autopilot paper.

Summary : Autopilot is Microsoft's soultion for automatic data center management. Autopilot automates most of the maintanence activities in a datacenter setting namely provisioning and software deployment, monitoring the resources and repair. Automation both decreases the cost of running a data center while increasing its reliability.

Autopilot comprises of Device manager, Provisioning and Deployment service, repair service and collection service. Device Manager is the central point of information for all other services.The watchdog monitors convey system failures to the Device manager but do not initiate any repairative action. Having a centralized model simplifies the system design,configuration and maintaining the state of the entire system - every other component needs to communicate with only one process. Howver it increases the latencies specially for any repair actions.

The design is based on the device Manager maintaining a golden image of the system state in a strictly consistent data store in line with the main design princeple of simplicity. This image has detailed information for all the machines in the deata center - the machine type, the OS image and the manifest the system is required to have etc. All other components derive the required information from the device Manager. They periodically reconcile the actual state of the system with the Device Manager's golden image through pull based mechanism. Pull model requires the least amount of state to be maintained at the Device Manager. The device Manager can also trigger the services to intiate a 'pull' to expedite some processes.

Provisioning and deployment:
Provisiong service periodically probes the network for any new machines and in conjection with the device manager installs the required OS image on the new system.The newly introduces system contacts the devic emanager directly for the service specifi manifest and using the filesync tool fetches the directory from the deployment servers. Both the deployment servers and provisioning servers follow a wealy consistent model. This works well for them as any errors can be corrected during the periodic system inspection.
New services are deployed by first updating the device Manager.device Manager triggers all the marked machine types to initiate the a fetch of the service biaries from the deploment server. Deployment is done in scale units which is a certain number of machines. New service deployment is staggered in scale units terms.This ensures system availability even during deplyment. A roll-out is considered complete if a certain number of machines successfully complete the installation. Otherwise a roll back is initiated.

Watchdog:
Monitoring is done on the machine level instead of the process or application level to spare the Autopilot of the application specific details. Each machine runs certain watchdogs that monitor some system parameters. An interesting design element is allowing the non-standard watchdogs. This lets applications define application specific monitors. The watchdogs send periodic messages about the system health.Device manager maintains a overall system health status which is simply a conjunection of the individual watchdog statuses.repair service periodically probes the Device Manager for any machines in need of repair. The rapir action is simply defined as Reboot,reimage or replace.The device manager decides upon a rapir action based on the past history of the machine. Any machine that has been repeatedly in probation state is marked for removal. Other problems may be solved by reimaging or rebooting.

This paper gives fairly detailed mechanisms for data center maintainence. The main design principle of simplicity is reflected in all the aspects of the system. This modle leaves out the policy while describing the mecahnisms. Automating the policies themselves would be more interesting.

One aspect of the design that weakens it is its inability to work with legacy systems. As the Autopilot follows a cooperative model between the maintainence units and the data center servers, it cannot support legacy systems.

The centralized model introduces additional latenceis. This makes it unsuitable for certain customer facing applications. However such applications themselves can actively inform the Device Manager of any anamolies or need for repairs in the system. In this context, perhaps the monitors could be made to send aggragate information to the device manager as well as the repair services which can immediately intiate the required action.

Section 9 describing the lessons learnt was a good read. It talked about a bunch of improvements the designers made by observing the syste in operation.

Posted by: Neel Kamal M | April 20, 2010 04:53 AM

I read Bigtable paper.

The problem addressed
=================
This paper describes Google's distributed storage system for structured data, called Bigtable. Its more like experience paper, where authors describe their design philosophies and choices while implementing large scale distributed storage for structured data.

Summary:
=========
The Bigtable uses three-dimensional data model indexed by row key, column key and timestamp. It maintains data in lexicographic order by row key. This allows user to exploit locality in range queries over rows. A range of rows are grouped together as an unit of distribution (called tablet). This means short range reads over rows that does not straddle many tablet boundaries can be completed by accessing only a few machines. Column keys are grouped together into sets which are basic unit of access control. This sets are called column families. Timestamp serves the purpose of versioning the entries of Bigtable. One important aspect of Bigtable is the fact that it does not support full relational data model. For example, it does not support transactions to operate on multiple rows. But the authors found that most of their application does not require accessing multiple rows together. Bigtable uses distributed lock service of Chubby for keeping track of tablet assignment over the machines and maintain single master in presence of faults. Internally Bigtable uses GFS to store the file and log and uses SSTable file format to store the maps. Bigtable also provides many optimizations like allowing user-specified locality group of columns, compression, caching etc. One of the important lesson that the authors talked about in the paper, is to avoid implementing complicated features unless there is evidence of its practical uses. Moreover, the authors highlighted the requirement of having proper monitoring mechanism for large distributed system for better management and administration.

Relevance/concerns/comments:
======================
Certainly, this work has important practical implications, given many major tasks inside Google are making use of this framework. One can view Bigtable as a customized (and restricted) implementation of large distributed database targeted specifically to the internal needs of Google's application. I guess the biggest motivation behind the work is saving money by not having to purchase off-the-shelf parallel database product. It would have been nice if there was a performance comparison between commercially available parallel database product and Bigtable for accomplishing similar jobs. One concern that I have about Bigtable's performance issues is the fact that it uses many separate layers of software like GFS, Chubby etc and I felt like this type layered/modular approach could help in code reuse and maintenance, but could also impact performance.

Posted by: Arkaprava Basu | April 20, 2010 04:04 AM

Trace route just not returning enough information to troubleshoot a network issue? What you need is X-trace! X-trace is your on stop shop for collecting all causally related traffic generated by a request. Simply put it's the be all end all in packet tracing because it works on multiple layers across VPNs, NATs, overlays, and other network complexities.

Tools exists for tracing particular protocols in certain situations but the users and administrators have to put all the information from different sources together to understand what's really going on. Each tool only paints part of the picture. Making matters worse are complicated network systems that obfuscate what really happens. When it's difficult to get the whole picture, including each piece involved, in the processing of a request it is difficult to find issues within the system.

X-trace is an ambitious attempt to create a clear picture of every piece involved. It describes a set of meta data that is to be recorded by the protocol and applications along the path of a request, while banded to it actually. This data could be analyzed to quickly determine the cause of the faults present. It's also nice that they present and address issues with X-trace. No related work seems to quite match the scope of X-trace. There is no doubt that extra information would make it easier to diagnosis problems but at what cost?

The real sticking point is the line "X-Trace requires that network protocols be modified to propagate the X-Trace metadata..." Now I believe its commonly accepted that network protocols could be better, TCP works and that's enough. However, when talking so casually about changing every network protocol and therefore most networking hardware you better have a damn good reason. Projects like this would probably not get the required traction to be useful on the WAN. They need to have enormous implementation benefits and need to be distributed incrementally. I just don't see enough promise for this protocol to prove pervasive. (points for alteration)

Posted by: Jeff Roller | April 20, 2010 04:01 AM

Chubby Review

Chubby is distributed lock service by Google in order to synchronize client activities. The paper explains the design of Chubby, its leverage by other Google services like Google File System and Bigtable.

Before Chubby’s development there was wide ranging distributed system with tens of thousands of clients and was facing the challenge of a robust primary lection mechanism. Besides heavy costs the system had low availability and as much unorganized. Thus there was a need for service that can provide availability, reliability, understandable semantics, higher performance and throughput and more importantly the synchronization. These demands acted as motivational force for the Chubby which later implemented synchronization specifically leader election and shared environmental information.

As we know that the aim of Paxos is not to consider the merits of one proposal over the another, but to arrive at safe consensus. This functionality of protocol can address the problem of electing a machine to play as master. Chubby not only implements this by Paxos but also utilizes it for fault tolerance by constructing a cell of five replicated machines. Paxos is used to get an agreement on order of client operations. The system design gives central place to lock service implementation rather then a consensus library and the generalized support for small files as well as facilitates large scale concurrent file viewing. The system adopts an event notification mechanism as well as consistent caching of files. Security through access control and provisions for coarse grained locks for reducing load on lock server and reducing delays during server failure is also employed. The main components of system are server i.e. Chubby cell, client library and RPC for communication setup. Proxy servers are kept as optional. Inside Chubby cell reads are satisfied by master alone while writes are satisfied by majority quorum, also replacement systems are provided on replica failure. Clients request master location to replicas and then all requests sent directly to the master.

The system addresses wide range of audience but the locks, storage system session-lease in one service designers are best suited. The chubby lock service is fully distributed lock service with coarse grained synchronization for Google’s distributed services. Chubby provides a high level interface. The designers clearly understood the expensiveness of lost locks for clients. The system also addresses the issue of repository for the files requiring high availability and primary naming service. Chubby is thus a relatively heavy weighted system.

This is more of an engineering manuscript rather then a scientific paper which shares experiences learned from development of Chubby lock services. Chubby can work well only well with whole file read and write operations and is also inapplicable in situations demanding fine grained synchronizations/ locking. There is also somewhat retarding tone of Google’s developers in the paper which is quite witty. There might be small instances of system unavailability and hence adequate solution is required during such outages.

Though the design is based on well known ideas of distributed consensus, caching, notifications and file system interface but the paper also opens avenues for using locks and sequencers with wide range of other services. The paper discussed a number of practical complexities encountered in the building of large scale system, making it a good to study.

Posted by: y,Kv | April 20, 2010 01:01 AM

I read the Bigtable paper.

Summary:
This paper describes Google's experience in building a distributed storage system tailored to their specific needs: It should be simple, flexible and should work with semi-structured data. The paper has valuable insights on the challenges of designing and building such a system. BigTable provides a sparse distibuted map to persistent data. It uses a combination of in-memory and disk data structures to provide high performance scalable storage. GFS is a building block of BigTable, while MapReduce can get input and push output to BigTable.

Problem:
The problem that BigTable tries to solve is: How do you provide reliable distributed storage is that is flexible, scalable, and supports a wide variety of workloads and applications? How do you leverage Google's existing infrastructure while doing it?

BigTable is sort of like MapReduce, in that it aims to handle the complexities of reliability, data distribution and management on behalf of the user. It provides flexible APIs to let the user decide a number of parameters, including how much data to preserve, when to classify data as not needed, and whether the application is throughput sensitive or latency sensitive.

Contributions:
Most of the ideas in Bigtable are straightforward mechanisms to deal with specific problems. The novelty is that they have built and run it for giant-scale workloads, and with thousands of machines.

The biggest contribution of the paper seems to me to be section 9, Lessons. I wish we had read some of it before embarking on the first project. The lessons about adding new features, simple design, and need for monitoring, are relevant even at 4 node level!

Comments/Questions:
The authors remark that it took them 7 man-years to complete the design and implementation. That sort of speaks about the difficulty in implementing such a large scale distributed system!

The numerous of copies of data in the system make me uneasy - There will be an in-memory copy, SSTables, two alternate logs and the multiple versions of the same data. It is not explained very clearly how the system can recover from a crash properly when say, the system was in the process of transferring data from one form to another.

Posted by: Vijay | April 19, 2010 07:54 PM

I read Chubby and I thought it was a good informative paper.

Summary:
This paper describes Google's Chubby lock service. The main purpose of Chubby is to elect a master for reaching asynchronous consensus. Google engineers implement a lock service rather than a library embodying Paxos for a number of convincing reasons, such as developer needs. In addition to providing a lock service, Chubby provides a mechanism for advertising the results by providing small consistent storage. Chubby was engineered with focus on reliability and availability rather than performance due to the characteristics of usage cases.

Problem:
The problem that Chubby tries to solve is reaching distributed consensus. The algorithm for reaching distributed consensus assumed by Google engineers has Paxos at its core and therefore requires election of a master. Chubby enables a fairly transparent and easy to understand mechanism for electing a master. It is interesting to see that Google engineers place high value on serving developer needs. Thus, at least 2 out of 5 reasons presented as rationale for Chubby are concerned with making developer's life easier.

Contributions:
The paper explicitly states that it claims no new algorithms or techniques. Nonetheless, I thought that this paper was very informative. I really liked how consistent caching was enabled by using what seemed like a coherence protocol (send invalidations to sharers) as well as a lease system. So, if for example invalidation does not reach a particular sharer due to network partition, the sharer will simply time out and self invalidate. From what I know, Chubby is also Google's only consistent service, so any other service like BigTable and others that need to perform any consistent operations, rely on Chubby.

Comments:
What is negative caching?
I still am suspicious of some corner cases, in which an invalidation may not reach a particular sharer, but it manages to contact the master and renew the lease and thus retain inconsistent data. I don't know how they tested this system, because testing a traditional coherence protocol is very hard. When you pile locks and leases on top and put it inside a distributed system, I can only imagine what kind of corner cases would come about. I also didn't think that it was elegant to have the lock-delay. Everything else was engineered to ease developer's job, except for lock-delay which I am sure is hard to tune. Finally, I thought that combining a database with a lock service is not a very modular design. And, as always, the scaling story is not very believable. At the end of the day, they have a single server servicing a large number of clients. The proposed techniques, from my perspective, add too much design complexity.

Posted by: Polina | April 19, 2010 06:49 PM

CS 739 Review Blog

Data Center Services

Comments

Post a comment