CS 739 Reviews - Spring 2012: Lessons from Giant-Scale Services

This paper offers a general guideline to design of single site, single user systems that serve a client base spread across the Internet. The methodology used by the author involves pulling ideas from past experience and presenting them in condensed form but not providing any more support than his reputation as an expert.

The paper starts with three models that show different approaches to load balancing. The most intriguing one he spends the least time on, a model where "smart" clients are enlisted in the load balancing system. This strikes me as the best distributed approach, better than trusting to the network or providing a single point of failure front-end. However, there are proportionally greater challenges in the design of such systems, particularly in security. I wish the author would go into more detail here or provide examples rather than treat it as a throwaway point.

There are a variety of metrics brought up that are traditionally used to measure performance. These all in some sense measure the availability of the system, the quality designers most want to optimize. The most critical one often overlooked is time to repair. As this involves fairly predictable behavior for the support staff, it is stable and improvements show quick turnaround. The author also introduces a new guideline for thinking about metrics. It's called the DQ principle, (data per query)*(queries per second) = constant. This makes sense on simple mathematical level but I get the feeling that there's some deeper meaning intended that I'm not perceiving. As he uses it extensively, I wish the author would go into more detail about DQ as I don't understand it.

In the question of replication vs partition, the author comes down in favor of replication after a sufficient system size is reached because it won't do worse than partition in expected behavior and offers greater control. I agree with this assessment provided our system lies within the parameters set above, that the data set is slow to change.

The author makes a good point that one cannot prevent burst events but most plan for them as best possible. The only tool mentioned toward this end is a variety of access control policies to make sure some traffic (the most important traffic if it's divided into classes) continues to be served. I feel that there is more the designers can do in these cases. For one, we have seen ways that systems can shift work to offsite machines either through caching or buying time in a data center. I want to see an analysis of more alternatives.

Finally, the paper sets out three methods to handle the upgrade process. Each machine upgrade takes a similar amount of time so in one sense the capacity lost to the system during an upgrade event is the same for all methods. I would like more about how the methods can be ranked if the system has predictable access patterns, in particular if there are known periods of low activity. The method the author calls big flip appeals most to my aesthetic sense. It is very clean and strikes a balance of speed and availability.

Posted by: Brian Nixon | February 2, 2012 08:52 AM

The paper is about the challenges of practical giant-scale data-driven services and suggests simple ways to tackle them.

The paper addresses a general problem of satisfying a very stringent requirement for high-availability of a giant-scale service to provide good end-user experience. The scale of these systems is very large that it is essential to model them to understand their requirements and device methods to tackle the problem using this simple model.

The major contribution of the paper is metric suggested to measure the "availability" of a system. The use of yield (percentage of total user request, serviced) and/or harvest (quality of the result) of the system provides a better measurement of the availability than uptime, which considers all time of day is of equal importance irrespective of the amount of outstanding user requests. Another major contribution is the DQ principle that models the maximum ability (or capacity) of a system at any particular time. This DQ principle can be used to measure the impact of failures in the system and can also be used to tune the system to handle these failures gracefully. An interesting observation about the linear scaling of the DQ value with the system itself helps in even preparing the system for future user-traffic. It is also demonstrated on how this DQ value can be used to make an informed decision in choosing between replication and partitioning in the system. The paper talks about the significance of evolution of the system in the fast growing WWW and suggests many ways to do online-updates to the system. Quantifying the trade-offs between the update techniques: fast-reboot, rolling upgrades and big flip, in terms of availability (DQ value) is a valuable contribution for the distributed system community. Finally, the paper also classifies/surveys various possible load-balancing techniques starting from highly centralized to completely distributed techniques (smart-client).

On the other hand, it is not clear how a node failure in replicated system is not similar to a node failure in partitioned system in terms of capacity. That is, the paper suggests that partitioned system sustain the system capacity even under node failures (and only incurs a drop in harvest value that is not present in replicated system). It is also hard to agree in general, that reduction in frequency of a failure is less importance than reduction in time required to repair. This choice is highly dependent on the actual values of the two in real systems. The overall time spent in repairing over a period of time involving many failures is the actual loss which can be reduced only when both the time between failures and the repair time is reduced significantly.

Posted by: Venkatanathan Varadarajan | February 2, 2012 07:58 AM

The paper focuses on the techniques and principles used by large scale services to deal with issues related to availability such as replication, failure recovery, disaster management as well as upgrades and updates caused by constant growth and evolution. The paper does not introduce any novel concept but neatly lists the real world problems in addressing scalability and availability and the solutions used in industry.
The important take away points from the paper are:
1) Load management was originally done using round-robin DNS but it affects availability when a server fails due to long expiration times. Alternatively, layer-4 and layer-7 switches are employed to direct load to active servers. Other than these special front end systems and smart clients also help in parsing the content and doing load balancing. An important point to note here is that these should not create new single points of failure necessitating the use of failover switches.
2) In addition to the traditional metrics of MTBF, MTTR and uptime, the paper talks about two other metrics: yield and harvest which correspond to the ratio of queries completed and the ratio of data rendered. Failures in replicated system affect yield and failures in partitioned system affect harvest. The paper introduces a metric called DQ which combines the effect of both yield and harvest in measuring workloads and comes in handy to plan and test upgrades.
3) In order to deal with heavy loads the data store may be simply replicated or partitioned with a load manager that analyses the content of the request and directs the query to the corresponding server. When failures occur a replicated system handles it more gracefully than a partitioned system provided they can support the additional load.
4) Admission control is a way of dealing with overloading that occurs during peak hours and failures. Based on the application, the type of admission control would concentrate either on reducing queries served or data served.
5) Software upgrades are done through a fast reboot with some downtime or through a rolling upgrade which has to tolerate version differences or by doing a big flip which upgrades 50% of the nodes at a time.

The assumptions made in the paper are that the system works based on queries and that majority of the queries are read-only. Though these assumptions are reasonable, it is also important to think about the applicability of these principles in systems where a significant portion of the requests involve computation and updates. In such a system issues such as consistency would impact the creation of replicas. Also, metrics such as DQ would not suffice. Though this is not a flaw, it would have been nice if the paper had addressed those issues. A point that is not clear is why write traffic would require more DQ points compared to a partitioned system.
Overall, the paper gives a nice and concise summary of many of the techniques that are widely used to address the practical issues related to availability and evolution.

Posted by: Madhu Ramanathan | February 2, 2012 05:50 AM

The author discusses the design and analysis of giant-scale services in terms of availability and scalability.

The paper does not meet the requirements of an academic article (well defined computational problem, related work review, experimental evaluation, etc.); instead the author describes it as an "experience" paper. I believe it contains invaluable information about designing giant-scale systems in the high level.

Author starts with summarizing the advantages, components, and requirements of large systems. Although they are predictable, it is important to hear from an expert that they were also applicable to real giant-scale systems. Then he mentions metrics for measuring high-availability, mechanisms for graceful degradation, ways to deal with online evolution.

I believe the best part of the paper is that it provides possible questions to pose in the design phase of a giant-scale service. For example, it is important to decide which metrics to use to measure the availability, how to upgrade, how to deal with saturations and so forth. I also like the real-world examples, especially Inktomi, about how some companies deal with such issues.

I do not like the load management discussion since it goes too specific unlike other parts of the paper. In addition, some quantitative values are too vague even for an experience paper. For example, it is hard to be convinced by the linear scalability of DQ and 2-node replication/partitioning discussion.

I am surprised that how little the author talks about the efficiency (in terms of query response times) issues. I feel like, it can be possible to design a giant-scale system with these advises, which is highly available, scalable, fault-tolerant but terribly slow. I think the efficiency lurking in the background of these ideas, but still it might be better to discuss it a bit further.

I believe most of the concerns mentioned in the paper are still valid and important for current giant-scale systems. On the other hand, as regards the replication/partitioning, I believe the current effort goes way too beyond. For example, large social-media websites like Facebook or Youtube, need to deal with queries which require large files in very short intervals (e.g., profile home pages that contain dynamically changing photos, status updates, etc.). It is getting more and more important to store these files using caching and map-reduce based techniques; which also increases the importance of efficiency and disk management. Other than availability metrics, efficiency metrics such as query response time variance is getting more popular.

Overall, I believe the paper is a very good source for some high-level design and analysis of highly available, scalable, and fault-tolerant giant-scale systems. Though, it provides only a small subset of all concerns.

Posted by: Halit Erdogan (spell checked) | February 1, 2012 11:40 PM

The author discusses the design and analysis of giant-scale services in terms of availibility and scalibility.

The paper does not meet the requirements of an academic article (well defined computational problem, related work review, experimental evaluation, etc.); instead the author describes it as an "experience" paper. I believe it contains invaluable information about designing giant-scale systems in the high level.

Author starts with summarizing the advantages, components, and requirements of large systems. Although they are predictable, it is important to hear from an expert that they were also applicable to real giant-scale systems. Then he mentions metrics for measuring high-availability, mechanisms for graceful degradation, ways to deal with online evolution.

I believe the best part of the paper is that it provides possible questions to pose in the design phase of a giant-scale service. For example, it is important to decide which metrics to use to measure the availability, how to upgrade, how to deal with saturations and so forth. I also like the real-world examples, especially Inktomi, about how some companies deal with such issues.

I do not like the load management discussion since it goes too specific unlike other parts of the paper. In addition, some quantative values are too vague even for an experience paper. For example, it is hard to be convinced by the linear scalibility of DQ and 2-node replication/partitioning discussion.

I am surprized that how little the author talks about the efficiency (in terms of query response times) issues. I feel like, it can be possible to design a giant-scale system with these advises, which is highly available, scalable, fault-tolerant but terribly slow. I think the efficiency lurking in the background of these ideas, but still it might be better to discuss it a bit further.

I believe most of the concerns mentioned in the paper are still valid and important for current giant-scale systems. On the other hand, as regards the replication/partitioning, I believe the current effort goes way too beyond. For example, large social-media websites like Facebook or Youtube, need to deal with queries which require large files in very short intervals (e.g., profile home pages that contain dynamically changing photos, status updates, etc.). It is getting more and more important to store these files using caching and map-reduce based techniques; which also increases the importance of efficiency and disk management. Other than availibility metrics, efficiency metrics such as query response time variance is getting more popular.

Overall, I believe the paper is a very good source for some high-level design and analysis of highly available, scalable, and fault-tolerant giant-scale systems. Though, it provides only a small subset of all concerns.

Posted by: Halit Erdogan | February 1, 2012 11:34 PM

“Lessons from giant scale services” is an experience paper that doles out the hard-earned lessons of managing many enormous data centers and services. Its primary focus is on how a designer must look at the larger picture instead of just smaller items that may seem more important. For example, the author presents the formula: data per query * queries per second = constant, and calls it the DQ principle. With this in hand, he notes that most schemes are just trade-offs between these two quantities. Furthermore, with his harvest and yield quantities, he notes that it makes sense to optimize whatever is easiest to give an equivalent result: for example, MTBF (mean time between failures) and MTTR (mean time to recovery) have the same impact, and MTTR is easier (generally) to fix, so it makes sense to focus on that.
Since this paper is more of an experience paper than novel research, it is generally trying to solve the problem of lack of knowledge by imparting what its author has learned. Issues that they believe people have are: lack of basic hardware and preparation, lack of metrics, lack of analysis, not having a disaster plan or focusing on graceful degradation, and not automating enough. He focuses especially on metrics and tolerance of failure. Various ways of trading off yield and harvest are suggested, as well as the benefits of partitioning versus replicating when failure occurs: replication is more important unless the load is very write-heavy because disk is largely not the bottle-neck. As far as failure tolerance, the paper argues that a small set of failures can cause a cascading effect on the entire data center as it becomes overloaded. It advocates throttling (admission control) to prevent this overload; queries can be throttled on cost, importance, or even just randomly. This is a very important point as some systems cannot perform a cold start under heavy load.
His discussion of replication versus partitioning could be a flaw: many services these days are write-heavy, and partitioning can be a very useful tool, especially since replication can be difficult to keep in sync under heavy write loads, and DBMS should not be discounted as a cost entirely. Another flaw is that the paper’s advice is valid for the set of systems which he has worked on -- mostly web, read-intensive, and non-critical services. They would need to be modified to work on systems that do not fit those criteria. This may, however, be more of a limitation than a flaw.
This paper, while not novel research, is extremely applicable to anyone building such a large service, as some of the conclusions (eg optimizing MTTR) are not at all obvious at first glance. Furthermore, a quantification of yield and harvest instead of simply uptime or performance is useful in distinguishing the trade-offs that must be made in the design.

Posted by: David Capel | February 1, 2012 10:55 PM

This paper gives some practical advice for large systems which must be scalable, available, and support rapid development cycles. Viewing everything from the DQ metric seems to give some insight into the system. This metric gives a way to compare replication versus partitioning and even how to roll out updates. The discussion on yield versus harvest gives so insight into failure modes.

The article focuses primary on strong practical aspects such as a cluster that;
“features extreme symmetry, no people, very few cables, almost no external disks, and no monitors.” This sounds sounds like a modern data center; and all good ideas.

My favorite part of the paper is the discussion of the System R optimizer. The author suggests the System R approach of minimizing I/O is not as good as a DQ focus. The author offers a very compelling reason, “because I have found that it correlates better with capacity and seems easier to affect gracefully.” (Obligatory link: http://xkcd.com/285/). This nebulous statement because in a database you generally don’t have control over D, so you try to maximize Q. Since, in a typical single host DB, Q is inversely proportional to disk latency, a smart optimizer would try to minimize disk latency (hence random I/Os).

The author follows this with the simple claim that these systems,”tend to be more network-bound than disk-bound.” One example of when this utterly false is facebook. Facebook deploys a massive memcache cluster to handle their queries. One day when the memcache cluster went down, the backing DBs tried to repopulate the cache. The problem was that at the mean work load, the DBs would over saturate and crash before they could refill the memcache. In order to re-saturate the cache, fb had to drop the majority of the queries and slowly ramp traffic back up over a few days. This would suggest that FB is vastly more disk-bound than network bound.

The author also discusses that services have “read-mostly” traffic and very few update queries. I would stay that in many systems today, eg. google, there are no update queries. It is read and append-only. This is a different model than the one he seems to discuss and it would be interesting to have a deep discussion on if/how this differs.

Overall, this by far the most applicable paper we’re read. Although it is slightly undermined because it is specific to the technology of the time (which is now dated). The primary contribution is in the understanding of harvest and yield as a better metric that uptime. But this depends how you define uptime. In some systems, uptime may be defined by 100% harvest and yield (e.g. the icecube project on campus which collects physics data). But in the average system the distinction is worth noting.

Posted by: Victor Bittorf | February 1, 2012 10:24 PM

This paper lays out relevant metrics and thumb rules for highly available single-site clusters in the context of design, load balancing and coping with availability issues.
In the industry, several practices were being followed for load balancing, for replication and partitioning and for ensuring availability under various failure/evolution scenarios. However, there were no standard metrics for quantifying availability. Also, there wasn't a set of thumb rules, carefully crafted from experience and analysis of systems, to guide the designing of infrastructure for hosting giant-scale services. These issues are addressed by the paper.
The paper's contributions are not traditional in that, they lay down a set of metrics and rules for designing and maintaining large scale systems, rather than introducing a new idea or a concept.
The ideas put forth by the paper are:
- Instead of the classic DNS based round robin load balancing systems can either use:
- A layer 4 switch that serves the dual purpose of load balance open TCP connections and also detect failed nodes.
- A layer 7 switch that distributes load based on content present at each server - this would be really necessary for a partitioned data store to prevent the request redirect messages in the backplane.
- It is easy to develop techniques to improve MTTR (for both hardware and software) rather than MTBF, to increasing uptime.
- One of the very valid points put forth by the paper is that uptime would not be a very good metric to quantify the availability of the large scale system. It says that yield and harvest would be more relevant metrics because:
- Though the system is available, we might want to know how fast the system is processing queries and how much of data is returned for queries
- Yield and harvest are sensitive to failures and upgrades and allows for the quantification of availability during failures and upgrades
- They quantify availability more concretely in the presence of replication and partitioning.
- The paper introduces the DQ principle, which is essentially the rather obvious fact that the system has a peak throughput (peak rate at pushing data out) bottlenecked by some factor (like The IO bandwidth or the number of seeks per second). The peak throughput is the product of the peak yield and the average harvest.
- The DQ value is a much nicer metric to study availability since:
- It is similar to the CPU performance metric - the absolute DQ value makes no sense, but the relative value, with respect to some change in the system (a failure or an upgrade) tells us how much of availability have we sacrificed (gained) due to that change.
- The converse is also possible - if we predict the system to support a certain a throughput in the future, we can deduce what hardware and software changes need to be made to guarantee that availability.
- It quantifies the effect of replication and partitioning on system availability.
- The DQ value tells us how to gracefully degrade during saturation. When we want to reduce availability by a factor, systems can focus on either reducing the yield or harvest by that factor to bring about graceful degradation.
- Upgrades can be thought of controlled failures and if we know the MTTR of a node, we can estimate the DQ lost during the upgrade. One upgrade technique used in practice is not necessarily better than the other since they all have the same DQ loss - but spread over time in different ways.
The paper might have small flaw in its way of describing the availability metrics. The very metric of harvest may be irrelevant in certain large scale systems that need to insure correctness while answering query (no query can be partly answered). It makes sense for other large scale systems, though (like search engines). It would have been nice if the author had mentioned that the systems under discussion can tolerate partial answers to queries. Since the suthor does not talk about redundancy in partitioned systems, I'm guessing that certain parts of the paper apply only to systems that can tolerate partial answers to queries. Since storage clusters tend to use some form of redundancy, that could have been incorporated into the metrics.
The paper is very much relevant to today's large scale systems. The paper has laid out ground rules for designing and managing a highly available data center. Since data centers have become almost a norm in large scale services, the metrics and rules are very much applicable today.

Posted by: Venkatesh Srinivasan | February 1, 2012 12:40 PM

CS 739 Reviews - Spring 2012

Lessons from Giant-Scale Services

Comments

Post a comment