CS 739 - Reviews - Spring 2011: Web Search for a Planet: The Google Cluster Architecture

« Web caching with consistent hashing | Main | Lessons from Giant-Scale Services »

Web Search for a Planet: The Google Cluster Architecture

Web Search for a Planet: The Google Cluster Architecture. Luiz André Barroso, Jeffrey Dean and Urs Hölzle, IEEE Micro, March-April 2003.

Review for this or other paper due Thursday, 1/27.

Posted by Michael Swift on January 26, 2011 09:17 AM | Permalink

Comments

Summary:
In this article, the authors aim to describe what the Google cluster architecture is. In doing so they mention what challenges they are facing and their solution to the challenges. A lot of emphasis is on the nature of the application that is running on the cluster as it is highly parallelizable. Also, cost and price/performance is another factor that is critical to their design.

Problem statement:
The problem that they are trying to solve is essentially having a distributed cluster which has the desirable properties of a distributed system that we talked about in the class. Of course, the desirable properties are defined subject to the particular application that they have and the scale of their service.

Summary of the contributions:
This was not a strictly research paper. However, we can talk about what interesting design decisions they have made and how that makes their cluster a desirable distributed system.

Desirable properties that they mention clearly:
Scalability: Their system is scalable in a sense that they can easily add servers and they can accommodate increased load. They do that by horizontal scalability and adding nodes of the same size.
Handling failures: This is an important part of their design as they have frequent failures in their cluster and these failures are in nature partial failures. Since they do replicate the data, they can tolerate partial failures until the broken node comes up.
Speed: They geographically distribute their clusters, so that people from around the world get a low latency. They also use extensive amount of parallelization in computation of the result.
Consistency: They sort of can ignore this factor as they assume updates are infrequent.

Desirable properties that is not so clear from the article:
Efficiency/power utilization: From this article it seems that they are not so efficient in power due to the fact of using commodity PCs in their clusters. However, I know that Google has a Going Green initiative. I would be interested to know how they are addressing this problem.
Testability/monitoring: It is not obvious how they solve this issue in the article. It seems that since their software is homegrown and their hardware is mostly simple and homogenous this is not a big deal.
Security: They do not address this in the article. But I would assume some sort of firewall exists to deny unauthorized access to the cluster servers and data.
Decentralized administration: Again, it is not mentioned in this article how they manage this. However, it seems to me that in case of their plants around the world, each of them could easily be managed independently as they are standalone and have the information they need to complete a query.

Flaws of the paper and real world applications:
Because of the nature of this article which is not a research paper, it mostly talks about how they ended up succeeding in their business with a significant emphasis on cost and price/performance.
One should not conclude from this article that this reasoning would work for every large scale cluster. We should note that using commodity PCs instead of higher end PCs works if you have a highly parallelizable application running on the PCs.
Google for sure is both a role model and competitor for different search companies that exist or are to appear in the near future, and thus we can say that their architecture has a huge impact on how others would go about doing things. We can compare Google to Toyota in the car manufacturing field.

Posted by: Fatemah | January 27, 2011 06:27 AM

Summary: In the aforementioned article Barroso et al present the Google cluster architecture : A highly available, low latency and scalable web-search distributed system built using commodity hardware with the aim to get better price per performance (cost per query).

Problem Statement: The fundamental problem that Google Cluster tries to address is to build a cost-efficient distributed system (scalable, Highly available, Fault tolerant, energy efficient and whatnot!) aimed at running workloads typical to their problem space - highly parallelizable, no private state and mostly read-only.

Summary of the Contributions:
One of the biggest contributions of the Google architecture has been to successfully demonstrate (beyond a reasonable doubt) that one can build a reliable distributed service with all the desirable properties of a typical distributed system at a very low cost using commodity hardware. One can argue that it is relatively easier to build such a system if one has enough money to throw at bigger, faster machines with more memory and disk - but to do so without having the costs go through the roof is indeed remarkable.

It would not be wrong to say that perhaps it is this architecture that sparked interest in other groups to try building such cost-effective scalable systems using off-the-shelf systems. (BCube and DCell immediately come to my mind, not to mention a lot of work done in the networking space (routers etc)).

Also, MapReduce is a direct extension of their aggresive exploitation of inherent parallelism in their work-loads (parallel lookups and merging)

The other key take-aways from their architecture are using software based fault tolerance mechanism (since cheap h/w doesn't have high shelf-life anyway), multiple levels of replicated load-balancing, sacrificing consistency for read-only work-loads and low management overhead (since their application is homogenous).

Flaws:
To me, there aren't any significant flaws in their architecture, as described in the paper. It scales well for the workloads they seem to run.However, what worked for them might not work for any other company. Btw, as per the paper, it seems they were still struggling with getting their power consumption (watt per unit of performace) numbers down. It would be interesting to find out what they did to solve it given so much focus has shifted to power-aware data churning since then. (Trivia: Most of Google's data centers are located in places with geographical stability (low seismic activity, no volcanoes etc - to prevent destruction of data centres due to mother nature) and high availability of power sources)

Posted by: Rohit | January 27, 2011 08:16 AM

CS 739 - Reviews - Spring 2011

Web Search for a Planet: The Google Cluster Architecture

Comments

Post a comment