CS 739 - Reviews - Spring 2011: Lessons from Giant-Scale Services

« Web Search for a Planet: The Google Cluster Architecture | Main | Distributed Computing in Practice: The Condor Experience »

Lessons from Giant-Scale Services

Lessons from Giant-Scale Services. Eric A. Brewer. IEEE Internet Computing. Vol. 5, No. 4. pp. 46-55. July/August 2001.

Review for this or other paper due Thursday 1/27.

Posted by Michael Swift on January 26, 2011 09:18 AM | Permalink

Comments

Summary:
The paper is description of techniques to providing high availability, scalability for growing and evolving Web services .

Problem Description:

This paper aims at describing various methods that can provide for high availability and scalability for internet based services like email and instant messaging. It is focused on single-site , single owned and well connected clusters. It discusses techniques like replication, graceful degradation, disaster tolerance and online evaluation for providing for high availability.

Contributions:

One of the main ideas discussed in the paper is about effective load management. Traditional load management involves using round robin DNS. But this does not provide for detection of inactive server nodes. To tackle this problem they suggest using layer 4 and layer 7 switches which can interpret TCP and HTTP packets and do better load balancing. The other suggestion they make is to use smart clients which can be programs like web servers that can communicate with a layer 4 switch and assist in better load management.

The paper also talks about measuring availability using uptime, yield and harvest rates. It suggests to optimize MTTR in order to get better uptime as it is possible to have a quicker debug cycle for MTTR rather than MTBF. The paper also discusses the metric DQ which is the average amount of data transferred per unit time. Limiting the DQ value in cases of heavy load could help in achieving the necessary yield, harvest or uptime required. Using the DQ metric the paper explains that for true replication one must not only need a copy of the data but also double the DQ value. This also explains that replication is needed after a certain threshold as the bottleneck is not storage but DQ.

The paper also talks about graceful degradation in the case of heavy loads in order to get the necessary yield rate. Similarly smart clients could be used for disaster tolerance. And finally the paper talks about using rolling upgrades to cater to the evolving nature of software in such a system.

Flaws:

1.> The paper talk about systems with read only work loads. I would be interested in learning about similar principles that should be used in a system with more write workloads.

Applications:

1.> The designs principles could be used for any web service today

2.> The ideas of online evolution could be used in any data center that needs upgrades with minimal impact to availability

Posted by: Vinod Ramachandran | January 27, 2011 01:31 AM

CS 739 - Reviews - Spring 2011

Lessons from Giant-Scale Services

Comments

Post a comment