The Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung; SOSP'03.
« Disconnected Operation in the Coda File System | Main | MapReduce: Simplified Data Processing on Large Clusters »
The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung; SOSP'03.
Comments
This system is designed to handle a different request distribution than we've seen so far. It is optimized for large files where the primary operations are read and append. Any direct data modification is assumed rare or small. This lets them avoid some of the difficulties of other distributed systems.
One change is an embrace of eventual consistency. No Paxos type consensus is performed, data is simply written to a file and a response is returned to the initiating node. Later operations bring files on different replicas into consistency by reordering as necessary.
They have given a new form of server hierarchy. There's not one class of server all handle requests and hold data, there isn't a master server supported by replicas that only vote to replace it, and there isn't a clean division into front end and back end. What does exist is a master node that handles metadata only and is responsible for matching requests from a client to a server holding the data for that object. The master sees the identity of all clients that interact with a given chunk of data so can easily impose an order to their collective operations. This ordering is used to make eventual consistency. Since data is rarely overwritten, the ordering doesn't have to be agreed on by everyone quickly - there is little cost to reorder your data versus recover lost data. To solve the problem of dead masters, they avoid the use of election protocols by taking advantage of the size of the master. To make it quick, it only holds metadata and fits in main memory. This means it is easily replicated to shadow masters that can temporarily step in to replace a failed master.
Another consequence of data reordering being much easier than data modification is that data can be appended to a chunk before the replica knows what position it should hold relative to the rest of the chunk. The authors refer to this as decoupling the data flow from the control flow. This spreads network load out over time and allows better use of slack periods of the network. I don't think I fully understand the linear chaining they describe for forwarding the data between the replicas.
Some other optimizations include replication across racks to protect from rack failure data loss and spread out network use, not using caches since file size is prohibitively large, and avoiding complicated data deletion agreement by delaying garbage collection for days.
This system setup reminds me of common access control schemes and I'm curious how easy it would be to either implement one through GFS or add one on top. The amount of state the master holds might get too big for it to retain its current advantages.
I'm impressed by the design of the system, it seems well matched to the model of operation distribution. I'm skeptical of the model. Perhaps append is much more attractive when there is copious amounts of memory available in the system and I'm simply not used to that being an accepted fact. I do find it intriguing that many of the systems we've studied of late seem to have embraced the master-servant style of server division. Has this trend won in terms of large system usage?
Posted by: Brian Nixon | March 29, 2012 08:39 AM
1. one or two sentence summary of the paper
The paper describes the design and implementation of the Google File System, a distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. And GFS is optimized for Google's core data storage and usage needs.
2. A description of the problem and challenges they were trying to solve
GFS was designed to satisfy Google's requirement for a distributed file system that can scale to TB's of data with thousands of unreliable disks. Certain specific important points need to be solved. First, GFS needs to be fault tolerant since faults are common; second, in GFS, files are very big, therefore some modifications are needed for I/O operations; third, an observation is that read operation is frequent, and for write operation, large write to the end of the file is common while random write is rare. The write and read operations needs to be optimized based on above observation.
3. A summary of the contributions of the paper
First, GFS adopts a master-slave model instead of a completely distributed system. Master maintains all the metadata and the mappings in memory and the chunk servers store chunks(64 MB chunks of file) as files in a normal Linux file system(with 64K blocks). As a result of this arrangement, the master has to store very little state (large chunks) which can then be kept in memory. Chunks provide similar advantages as provided by the RAID striping + mirroring across multiple disks.
Second, the master is responsible for all namespace operations. Each node in the namespace has a read-write lock, for each operation locks need to be acquired. Serialization is achieved but the master is not a bottleneck by performing only one operation at a time. The modifications to the replicated chunks are performed using leases that maintain the order in all the replicas.
Third, the garbage collector works in the background as a part of the state update information process between the master and the chunk servers. This allows the garbage collector to work in times of low activity. From the time a file deletion occurs until the chunks start being freed some time has to pass to avoid accidental deletion of files.
4. The one or two largest flaws in the paper
First, in GFS, Master seems to be a single point of failure in their design. While replication of mater is available, the model doesn't look like a hot-standby model so that downtime seems to exist. Second, the consistency guarantees given by the system seems not bulletproof, considering the fact that it is up to the client application to interpret the file contents while reading and ensure logical consistency.
5. A discussion of how the ideas in the paper are applicable to real systems.
Techniques used in GFS have been known since long and have been used at many other places. For instance, replication in RAID, centralized master server for control traffic (and location-independence) as in AFS. Lazy deletion and garbage collection has been used in programming languages and many other FS such as LFS, etc.
Posted by: Junyan Chen | March 29, 2012 07:58 AM
The Google File System is a scalable distributed file system for large distributed data-intensive applications. It was designed based on the the needs from application workloads and technological environment at Google. Therefore, it is optimized for large files with multi-GB in size which are common for Google. The architecture of the system follows a single master design, with the master server storing the namespace and mapping of files to where they are stored. The client first contact master to get the location of file it wants to read, and then contact the server that actually stores the file. The system itself is designed to be built on inexpensive commodity hardware which is expected to fail often because of the low quality of materials as well as the large number of those being used. However, it still guarantees consistency on successful operations, and provides the client information to detect inconsistency in case of failures.
I think the design of the system fits well with the specific needs of Google. Because the system is expected to store large files with multi-GB in size the files are split into 64MB chunks. Having big chunks allows the master to store metadata for large files in relative small space (less than 64 byte). It avoids internal fragmentation by doing lazy space allocation. Using a large quantity of low cost hardware makes failures common case for the system. GFS maintains high availability by having fast recovery when the servers are terminated.
GFS uses single master design that simplify the need for synchronization between servers but it does so in a way to minimize the bottleneck. The interaction of clients and master is minimized both in term of data per query and query per second. Having large chuncks allows the client to access files with fewer file handles. Clients asks for multiple file handles per request and the master includes the location of the chunks requested as well as a few following chunks in its replies. Because sequential reads are common use in GFS, that reduces the number of client requests. The master keeps most of its metadata in memory and uses frequent logging to keep them persistent. The master is also replicated to have shadow masters to tolerate its failure. The shadow masters can also provide read-only accesses to the file system.
GFS is a file system specifically designed for the needs of Google which includes commonly handling large files. It has shown to meet its purpose. In particular the architecture of the system will be useful for distributed systems that involve multiple large files.
Posted by: Min Thu Aung | March 29, 2012 07:27 AM
The Google File System is a distributed file system that is optimized for a completely different type of application workload (huge files, sequential reads, low write to read ratio, append-type of writes) that none of the earlier distributed file system had considered before.
The problem that the paper is trying to solve is on how to design a file system for a workload that is different from any other normal workload a local file system would need to handle. Also, on how to handle multiple, continuous failure of various components in the system because of the use of cheap commodity hardware.
The basic assumption and observation of the application workload at Google is the major driver for various design decisions in GFS. The workload that GFS needs to handle has the following characteristics:
1. Hugh multi-GB files are common and need not be optimized for small files
2. Needs to be optimized for large sequential reads and support small random reads
3. Writes are mostly append-type of writes and overwriting writes need NOT supported.
4. Concurrent append is common and need to be synchronized (some ordering) by design
5. Sustained bandwidth , higher latency is tolerable,
6. Major of them all, failure is a norm rather than a rare occurrence.
These assumptions of the workload makes the problem very different and interesting as even the basic norms in local file system like, smaller block-size are better, needs to be revisited.
There were numerous contributions in every aspect of the design of GFS, the major contributions are summarized as follows:
* Master - Slave model, makes the problem much simpler and GFS handles the inherent possibility for the master to become bottleneck (which is the major problem in master-slave systems) with various techniques like: Making the master responsible for only the meta-data and any kind of data related operation is done with the chunkservers (slaves), avoiding as much traffic as possible using leases and meta-data caches at the client, and larger chunk sizes (and hence the meta-data per byte accessed) reduces the number of client-master interactions.
* Identifying that the common access pattern does not require caching of the data (sequential reads) and hence avoiding the unnecessary cache-coherence overhead that need to be handled.
* Meta-data management in the master: It is designed to persistently store only the namespace and file-to-chunk mapping (using an operational log) and not the chunk locations. This is because chunk locations can always be gathered from the chunkservers at anytime (hence utilizing the aggregate disk usage and not replicating it in the master's disk) and file-to-chunk mapping is the most critical metadata that define the file system.
* Choice of the larger chunk size for many reasons like: reads are sequential and larger in size, lesser meta-data per unit of data and hence smaller overall size of the meta-data that could fit in main-memory but still supporting multi-GB files.
* Operational Log is designed for Faster recovery and it is synchronously replicated to other servers since it is the most critical information (and to not make the master a single point of failure).
* Support for concurrent record appends: Serializing such operations by using a primary chunkserver for not only the writes tries to achieve the consistency guarantees of the system (at least once).
* Hierarchical locking and replica placement are straight-forward, intuitive design decisions among the others.
Overall, the paper has many outstanding contributions in terms of the design decisions taken based on observation of their application workload. But it is hard to believe that only handful of applications do random reads. Intuitively an efficient search algorithm (one of the major applications at Google) and the data structure used is expected to have lots of random reads. For instances, hash tables, the most efficient way to search (O(1)) through a set of key, value pairs requires random reads and B+-trees or any other variants of trees (another alternative data structure for organizing the data for search) requires random reads (and writes). Though it would be possible to design applications to avoid random reads/writes but such a design would trade off read/write performance for inefficient data structures. It is hard to quantify such a trade off.
Though GFS is specifically designed for data-intensive workloads at Google, it is a very good example of a system design exercise that tries to optimizes for a very specific subset of all workloads possible.
Posted by: Venkatanathan Varadarajan | March 29, 2012 07:08 AM
Driven by the specific observations in Google's application workload and the common goals of other distributed file system, Google file system is designed. They made some assumptions that is not well considered in other distributed file system. For example, the system stores a modest number of large files; the workload primarily consist of large streaming reads, small random reads and large sequential writes. Throughput is more important than latency.
A GFS cluster consists of a single master and multiple chunkservers. Files are divided into fixed-sized chunks, and each chunk has a unique chunk handle and is stored in several chunkservers. The master stores all metadata such as (filename, offset)->(chunkservers, chunk handle) mapping. When a client issue a request, the master send back a chunkserver and chunk handle to client, then client send request to chunkserver directly. The master store metadata temporarily in memory. When each chunkserver starts, it will send its metadata to master. When master dealing with requests from clients, it will keep operation log and checkpoint.
To guarantee consistency, the master will select one of the replicas as primary, the primary will order the write sequence and all other replicas follows this. All replicas keep the data version in case that server crash or message lost causes updates to stale data.
There are also some other features in GFS, such as snapshot, namespace locking, replica placement and garbage collection. During snapshot, primary lease is revoke and all following updates are kept by master. Namespace here is used for fast lookups and data locking. Replica placement takes load balance in the whole rack into consideration. Garbage collection modifies garbage names and removes them periodically.
GFS has several contributions: (1) GFS introduces a new level of storage, that is chunk. By chunking files, when updating or replicating on large files, we can avoid large IO traffic on the whole file. (2) The metadate is stored in master memory and updated from the chunkserver. In this way, the master has less workload and it is easy to keep metadata accurent. (3) Data flow and control flow are decoupled. This is really innovative. Data can be sent in pipeline to maximize network utilization and control flow can be send in parallel to guarantee consistency (4) Each chunkserver stores data based on Linux file system. This may cause a bit latency compared with operating on disk directly, but this simplify the chunkserver design and it can make use of linux features such as local logs.
Discussions:(1) When a system grows large, consistency becomes difficult. I see both in Dynamo and GFS, they all relax consistency requirement. This choice is based on the application requirements. What if consistency is the primary consideration when designing a distribution system such as bank accounting system. I think in that case, we may lose availability in some cases such as cocurrent writing.
(2) GFS use single master. Althrough they use some mechanisms such as redirect read/write to chunkservers, the master can still be a bottleneck. I am wondering how large scale application GFS can support and what applications are running in GFS. For example, how large scale is GMail? Can Gmail run in GFS? (I do not think so, what file system is GMail running in? Especially how do they deal with large amount of request in front end?)
(3) The choice or chunk size need to be reconsidered. The author claims that large chunk reduce client-master interact. I do not think so. The client issue request(filename, offset), the master returns list of (chunkserver, chunk handle) and there is no reduction on interacts. But it is true that the master can keep smaller metadata with larger chunk size. Large chunk size reduces tcp connections. This is true, but fewer TCP connections implies less network utilization. Is this a benefit?
Posted by: Wenfei Wu | March 29, 2012 05:37 AM
This paper presents the design, implementation, and a performance evaluation
of Google's disibuted filesystem GFS. GFS is not a general-purpose
POSIX-style filesystem (though it does have some similarities to one), but
instead a large-scale storage system designed specifically to perform well
with the kinds of applications Google runs on it. It consists of a single
master node that stores filesystem metadata and coordinates client accesses,
and a large number of chunkservers that provide bulk storage and data transfer
to and from clients.
GFS is targeted at performing well for Google-style access patterns. This
means it must be able to efficiently handle very large amounts of data,
usually in relatively large files. These files are typically written in large
sequential writes and appends, though also often by multiple writers
performing concurrent appends, which must be handled in a semantically sane
and useful way. Reads are a mix of small random accesses and large streaming
ones. Additionally, applications using GFS are generally much more demanding
of high overall throughput than of low latency, so small overheads like the
additional network roundtrips to first contact the master node and then make a
request to a chunkserver are of little concern.
In order to accommodate these sorts of access patterns, GFS divides files into
large chunks (64MB) which are distributed amongst the available chunkservers,
with the master node retaining the metadata mapping a file to its component
chunks and their respective locations. To provide resiliency to chunkserver
failures, each block is replicated to multiple nodes (3 by default); data is
streamed sequentially between these replicas in a pipelined fashion in order
to maximize utilization of the available network links.
In my opinion, the most glaring problem with the GFS design is the use of a
single master node. While it is in practice apparently not a major
performance bottleneck, and GFS does keep quickly-startable master replicas at
the ready waiting to be swapped in on short notice, it strikes me as
aesthetically unappealling and seems counter to the general goals of
distributed system design. Though implementing something like this would
doubtless be non-trivial, I would find using a scheme like Chord to distribute
metadata throughout the membership of the system to be a much nicer solution
to the problem. Metadata could be replicated like data chunks are, and a
client could then contact any arbitrary server (no longer merely a
chunkserver, but a chunk-and-metadata-server) for any arbitrary file access.
Also, I wonder about the real-world failure rates with kilo-node clusters and
a (default) replication factor of three. With that many nodes, it doesn't
seem at all unlikely that three or more nodes (in separate racks, not just a
group outage) could fail simultaneously. And with tens or hundreds of
thousands of chunks (perhaps even millions?), given such a failure, the odds
that there exists a chunk whose set of replicating chunkservers is a subset of
the failed nodes seems potentially non-negligible. If this were to occur,
such a chunk would obviously become unavailable. I would have liked to see
some discussion of if (and if so, how) the replication factor is adjusted
relative to cluster size in order to address this possibility, or if it has
ever actually arisen in practice.
Posted by: Zev Weiss | March 29, 2012 05:15 AM
This paper describes the file system that underlies most of googles web based applications. This file system (aptly called the google file system) is designed from the ground up specifically for their use case (long writes/long reads, infrequent modification). This system uses a modified version of the master/slave model where a master controls the allocation of blocks but clients connect directly with those block servers when actually transferring data. This type of system was done specifically to reduce the chance of the master node for the network becoming a bottleneck. They dealt with numerous challenges when building this system and overcame them with some interesting design choices. One issue that they were concerned about was consistency. They were worried about master server crashes causing the network to become inconsistent. To solve this they did a sort of distributed journaling style algorithm to keep track of the states of the various block servers. There are numerous examples similar to this one listed in the paper.
This paper was interesting due to the rare amount of detail on the design choices that they made were. It also shows that you can get a great mix of performance, consistency, and availability by customizing data stores to more closely fix the application that will use the system (instead of using a generic off the shelf storage solution).
Posted by: Benjamin Welton | March 29, 2012 03:58 AM
Google File System
The link to this paper was broken on the class website. A replacement was located here:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf
Google File System ( GFS ) is a custom distributed file system written specifically for google's own internal applications. It does not attempt to make use transparent to the system instead requiring the user to link against a custom library and it breaks posix semantics by adding atomic appends. In addition to the specific workloads it places several requirements on the network architecture being used to support it. They use the IP address of machines to control data forwarding decisions ( unlike the attempted in the epidemic replication paper ) is new. A unique idea mentioned in this paper is that the concept of a unit of failure is not just a single machine, but a rack as a unit.
Flaws:
This paper claims that it in practice it does not cause any detriment to GFS to keep all metadata in memory. A chunk id is 64 bits in size. A modern 64 bit os only gives you 48 bits of virtual memory space ( from wikipedia: the AMD64 architecture as of 2011 allows 52 bits for physical memory and 48 bits for virtual memory ). If a single chunk requires 64 bytes of metadata that's going to limit the effective number of chunks that can be present in a system ( given a master would store it's chunk data inside a single process, which given all of the arguments about simplifying design would make sense ) is going to limit the number of chunks possible in one gfs cluster to 2^(48-6) instead of 2^64. It is curious that the clusters described in section 6.1 have a number of chunks that would easily fit in a 32 bit number. A 32 bit chunk id would allow a gfs cluster to address 274 PETABYTES.
Along with the concept of decreasing the mean time to recovery the recovery of a dead primary master. Section 2.6.2 mentions that masters do not keep chunk location information in an persistent storage and that they query chunk servers for location information on startup. The time to recover with a large number of allocated chunks may be significant.
While extra features may not be required to support the custom google workload there are several that may be simple. Some support to increase a files replication if it becomes a hot spot would be appreciated. It would not be hard to implement as the master could increase a files replication if the inquiry rate for a chunk location exceeds a certain rate.
Posted by: Matt Newcomb | March 29, 2012 02:46 AM
This is an interesting paper about Google File System. GFS is a scalable distributed file system for large distributed data-intensive applications, that provides fault tolerance and delivers high aggregate performance to a large number of clients.
The goal of the system is to provide a file system that meets nowadays demands in a large data-intensive distributed system. The main demands include 1) the ability to deal with the component failures that happen on large number of inexpensive commodity hardware, 2) the ability to deal with large files, 3) friendly to appending new data to files instead of overwriting existing data, and 4) co-design the file system API with applications.
The main contribution of the paper is the design and discussion of GFS. GFS consists of a single master, multiple chunkservers, and is accessed by multiple clients. The use of one single master is always doubted for single point failure and bottleneck. However, Google engineers successfully answer those doubts. The big chunk size alleviates the load of the single master. Operation log is adopted to provide persistent historical record of critical changes, which provides reliability together with replications. GFS provides relaxed consistency model that provides guarantees to the applications on write and append operations while makes system simple. All the components are interacted with a well-designed protocol that defines the data and control flows of operations. In this protocol, leases is introduced to maintain the mutation order.
In master, the well-defined locking scheme enables the concurrent mutations happens on the same directory. There is also a replica placement policy to maximize data reliability and availability, and maximize network bandwidth utilization. It also supports the chunk creation, re-replication and rebalancing. Garbage Collection is also used instead of eager deletion because of the simplicity, regular background activities with storage reclamation, and safety to accidental, irreversible deletion.
To deal with failures, GFS provides fast recovery that the server can restart in seconds. Also, chunk replication provides higher availability, and master replication presents more reliability. It uses checksum to guarantee the data integrity, and diagnostic logging to help problem isolation, debugging and analysis.
In sum this is a well-designed system. However, there’s something that might be missed in this paper. The paper mentions about the aggregate performance of the whole storage system. However, it doesn’t discuss about the performance effect on single client. GFS introduces many optimized strategies for better aggregate performance, such as sorting the small random reads. However, from the application view, we don’t know if GFS provides guarantees on client’s performance. Unfortunately, the paper doesn’t discuss about this topic.
Another topic that is missing is the applicability on different kinds of workload. Nowadays, more and more services need real-time large data manipulations, such as the recommendation system, real-time user activities analysis, and so on. GFS provides high bandwidth but do not guarantee the latency of requests. It looks like GFS is not quite suitable with this type of applications. Also, for services like large photo services, GFS seems that cannot work very well.
In sum, this is a great paper about the distributed file system design. There’s no doubt to say that it’s still applicable nowadays for some types of applications. Hadoop also implements another version of GFS, which is widely used around the world outside Google.
Posted by: Xiaoyang Gao | March 29, 2012 02:35 AM
This paper describes the Google File System, a distributed file system designed for high reliability and large files while reducing guarantees for some file writing features. In general, writes are expected to be append operations rather than ones that overwrite previous data in a file.
What is probably the most important goal of the Google File System is to be able to compensate for near constant component failures of all kinds. There can be no guarantee that a certain machine will continue operating or not lose data it had stored. Another goal was to give better performance for large file sizes, which necessitated increasing the minimum size of any individual file.
The Google File System consists of two master servers, one of which is a replica, and many "chunkservers" which are coordinated by the master. The master is responsible for keeping track of file metadata, which consists of permanent data about what files exist and which data chunks are connected to what files. There is also information about which chunkservers store copies of which data chunks, but this is not permanent data and is reconstructed by contacting all the chunkservers whenever the server fails. An operation log for important items is always written to disk before those actions on the master server can be confirmed and allows for later recovery. The design allows for multiple clients, who must initially contact the master server to get a list of the chunkservers that have copies of the data that they need. After that, a client should cache the location of this data so it can continue contacting the chunkservers directly without needing to continue contacting the master server, at least for a certain interval of time. The chunkservers each store copies of certain data chunks originally assigned by the master server and keep a version number with each of the data chunks. The master server gives requests for individual chunkservers to add or remove data chunks from their load but it is not mandatory or guaranteed that these requests from the server will be followed. The goal is to have copies of the chunks on a certain minimum number of machines and also on at least two racks of machines in case a whole rack would fail. Whenever an error on a specific chunkserver occurs, such as corruption of a chunk or an outdated version number, the chunk at that location is tossed away and a new copy must be sent before the data is again present on that specific chunkserver.
Most writes are expected to be append operations. In the Google File System, append operations are not guaranteed to end up at an exact location in a file, but rather at some point at or past the current end of the file at the time of the request (data from other clients might also arrive around the same time). For write operations to occur, a chunkserver must be granted a lease on that data chunk from the master server. A client is responsible for sending the new data to all chunkservers holding that chunk of data. Once it has sent the data, it then contacts the chunkserver holding the lease, telling it that the data has arrived at all the locations. The chunkserver will then confirm the exact ordering that this data will be placed into the file and passes this along to the other chunkservers holding the data. In the case of a failure, a client is expected to continue trying the same write operation until it succeeds or else inconsistencies may occur. Due to certain design choices, the same chunks can have non-matching checksums on different machines while still being correct. This requires each machine to keep track of corruption on its own and for others to trust that its data has been correctly validated.
The design of the Google File System relies heavily on its use cases for within Google, such as appending writes. One potential problem seems to be with the reliance on each chunkserver to verify its own data and with no way for others to check against it. This could potentially allow one malicious machine to corrupt all the data. Another potential problem seems to be when chunks are automatically deleted when errors are detected in chunks, such as out of date data. It seems like cases where a large portion of the chunks would end up out of date, such as in partitions, could occur somewhat frequently and would wipe out perhaps most of the copies of the data. If the system indeed behaves this way, it leaves a certain amount of time where there are far fewer copies of the data remaining and the possibility of data being lost greatly increases.
This paper presents several different improvements to distributed file systems. Some of these changes, such as the appending writes, would be dependent upon whether a file system was expecting this type of behavior most of the time. Others, such as optimizing for larger files, will probably be more generally useful ideas. The primary goal of supporting a high number of failures due to the large number of components in the system is also applicable to many large systems. The continuing increase in hard drive capacity may cut down on the number of components required in the future, but there will always be a need for some systems with an especially large number of components and the high failure rates that go with that.
Posted by: Daniel Crowell | March 29, 2012 02:30 AM
Victor Bittorf, Igor Canadi, David Capel, Seth Pollen
Many of the previous file systems we have looked at (AFS, NFS, Coda) all try to emulate the POSIX semantics in a distributed fashion. Google diverged from this design goal to make themselves a file system specialized for: throughput, append-heavy workloads, fault tolerance, large files. In designing GFS, Google tailored the system to their specific needs (e.g. map reduce).
The paper does a great job of explaining how their specialized file storage system runs with availability using commodity hardware. The decision of a single master and relaxed consistency are, as we would put it, ‘epic.’ This decision is different than many approaches we see in research papers -- it is a very pragmatic approach. They do some clever protocol things to reduce load on the master but their approach to consistency was a source of much discussion in our group.
GFS forces more logic and error checking into the application layer simply because it can pad files, double write, etc. This draws a natural comparison to Dynamo which also supports ‘smart’ clients (more on this later). The approach in GFS seems to be very ad-hoc, however, requiring the applications to embed metadata into their files. It seems like an abstraction layer very tailored to their workflow (for example, most of their consumers are idempotent). While GFS is very successful for supporting this workload, this highly specialized design can introduce problems.
For example, now that google is rolling out Google+, everyone knows that google’s infrastructure is struggling with the new write-based workload. Now, there is speculation that Google is rolling out its own version of TAO (facebook’s graph model). Some members of our group say that this is fine, since GFS has a particular purpose (basically, mapreduce) and google can use a different tool for google+. Others in the group, however, point out that this goes against google’s philosophy which is to have all of their product use the same tools. Google has been known to twist systems to work with their existing infrastructure and oppose feature creep / new tools.
Either way, it is clear that GFS cannot support google+ effectively given that GFS is optimized for large file size and high throughput with relaxed consistency while social networking requires low latency, small update size, high consistency (users don’t like seeing old data, even if it is only minutes old).
Getting back to the original GFS workload (long scans of large files), it seems that GFS is a fairly optimal design (proof: google made it, google is awesome, QED -- except for google answers, google wave, dodgeball, google notebook, google buzz, orkut, ...). But the one thing we debated was the use of a single master. Yes, this simplifies the design, but, it introduces a bottleneck. During the design, they choose to only keep 64 bytes of state per chunk so that all chunks can fit in memory. This introduces the practical problem of how to ‘ls’ a directory. They discuss how they do it, and it just isn’t pretty. This was a design trade off which in some ways turned GFS from a file system to a key-value store with just really, really, really big values with block device style semantics. Because they don’t keep an inode like structure, a lot of the file system semantics (ls, mv, ln) don’t follow. But, again, this was a design decision and they seem to be happy with how they made it.
One thing we felt was clever was the way they handled the master startup. They didn’t cache the chunk to machine mapping, they would just requery it. This is a cute optimization. And again it shows why they wanted a single master (that can’t ‘ls’).
Looking to application, we pretty much assume it is mapreduce. Or hadoop -- but we all know (google wants us to think?) that google’s mapreduce implementation is a lot better than hadoop. Mapreduce is designed to work with mostly unstructured data such as what you find crawling the web -- organic data. This fact mirrors the sloppy ordering and consistency in GFS. If the specific order and structure of your data is important then GFS is not the place to store it; besides having to remove padding and duplicates, you would need chubby to just get ordering right and then your throughput would tank.
It is difficult to actually find fault with GFS since it is the foundation of mapreduce, which is hugely popular right now. Not even the members of our group who work at facebook were willing to fault Google on their design -- but they do claim google stole TAO (which is more likely than google stealing php or mysql). But then again, facebook claims to have the largest instance of hadoop currently in use (30 petabytes according to one of the fb blogs). And facebook using hadoop implies mapreduce implies gfs implies google knew what they were doing.
Posted by: Victor Bittorf | March 28, 2012 11:53 PM
The paper presents GFS which is a distributed file system developed by Google for its own infrastructure. The paper explains the design choices and why they choose them, along with empirical evaluations of the system.
The main challenge is to build a distributed file system that targets availability, fault tolerance, reliability, scalability, and high-performance. They accomplish to build such a system by making the following critical observations: Random writes are practically nonexistent (most of the case they are file appends) and files are generally large and written once and read sequentially. These observations (along with some others) help them design the file system that is tailored and optimized for specific Google Applications.
The paper summarizes an engineering effort to build a working real world system, like Dynamo or Chubby. The system is a result of gathering up a bunch of design choices. The ideas are not novel; whereas the resulting system is unique. The key design choice is to use a single master. Single master helps easily deal with locks, data partitioning, replication and so forth. The main challenge, on the other hand, is to make this single master scalable and prevent it being a slow bottleneck. They achieve that by simplifying the jobs of the master. The master only knows the metadata and keep it in the memory. The client asks for a file and master just tells where the data is and then client connects to that specific chunk server to execute read/write requests. This is quite similar how the front-end in LARD directs incoming requests. There are, of course, many other issues such as operation logs, failure recovery, garbage collection, replica placement, locking and so forth. The authors try to explain each of these features and design decisions on them. Among many of these features, I like the idea of atomic record appends since it is very simple yet very effective design (again they observe what they need and keep the design as simple and as efficient as possible to meet their needs).
It seems that most of the consistency handling staff is left to the application side so that it needs to deal with verification of the records. In Chubby, the authors mention a lot how Google Engineers have hard time using complicated designs. I wonder how they are dealing with the consistency issue in GFS applications.
Another critic is, of course, the design choice of having single master. It looks like it works well even for very large number of requests. Also, it should be very easy for Google to buy very-high performance hardware for such single masters. On the other hand, I am curious about the limits of this system. If there is any chance to hit the limit in the future, then I guess what will happen is to rewrite all the file system from scratch and gather new and different design choices together.
What I understand from this paper (and from Dynamo and Chubby), the current large real world distributed systems are not result of super innovative ideas. They are just resulted from careful observations and adaptation of known simple designs that achieve high-performance to meet specific needs. Generally, the goal is to achieve high-performance and availability, and the main design decisions are on what type of consistency is needed and how to achieve that. I believe future improvements on GFS will be on obtaining even more performance and deal with more real-time data processing rather than batch-mode.
Posted by: Halit Erdogan | March 28, 2012 11:38 PM
This paper is about the Google File System (GFS), a distributed file system developed at Google. GFS targets specific types of workloads, namely, workloads where most of the writes are appends and most of the reads are large streaming reads or small random reads. In GFS, a single replicated master node holds all of the file system metadata and many operation require contact with the master node (e.g., opening files, locating chunks, etc.) The data themselves are segregated into 64 MB chunks and held on chunkservers. When a client wishes to read, write, or record append a chunk, it contacts the primary chunkserver for that chunk and performs the operation through it. One of the most innovative ideas in the paper is that of pipelined replication of data to be written from one chunkserver to the other.
This paper is not as good as I remember it being, and I found some significant flaws. First, GFS blatantly commits one of the fallacies of distributed computing by assuming things about the network topology. For example, it relies on the IP address of the chunkservers to estimate the distance between them and try to find a short path for propagating data updates from the primary to each secondary. I don't understand why they couldn't just measure bandwidth and latency between chunkservers on the fly and have each chunkserver remember which other servers tend to be the fastest.
I also think there are some serious problems with their data concerning the performance of record appends as well as their explanation of how they work. First, consider Figure 7(c), which presents their data from their record append microbenchmark. The performance for that microbenchmark is obviously abysmal, never even getting to half the theoretical network limit. They explain that this is just because all the clients are appending to one file, whereas in practice each client will be appending to multiple files. They claim that the low throughput is due mostly to network congestion and variances in transfer times seen by each client. I don't buy this explanation. It fails to explain the basic problem of why the operations do not come anywhere near the network limit during the trials, since, according to an earlier claim, the network limit is given by the number of chunkservers in use, not the number of clients.
I think a more likely explanation concerns their implementation of record appends, described earlier in the paper. The authors somewhat disingenuously claim that record appends are very similar to writes except with a little extra logic at the primary. It is more complicated than that. When a client attempts a record append, it first pushes all the data to the primary, then makes the request for the operation at the primary. If there is enough room left in the current chunk, the operation is accepted. Otherwise, the primary pads the current chunk up to the next chunk, and tells the client to try again. This means the client must go to the master and get the next chunk location and try again with the next primary chunkserver, sending the data all over again. This strikes me as highly inefficient and likely to be the true cause of the poor performance of record appends.
Posted by: James Paton | March 28, 2012 10:40 PM