« Autopilot: Automatic Data Center Management | Main | FAWN: A Fast Array of Wimpy Nodes »

Bigtable: A Distributed Storage System for Structured Data.

Bigtable: A Distributed Storage System for Structured Data. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.

Reviews due Tuesday, 4/24.

Comments

Although Google has GFS to store files, but applications has higher requirement. GFS only provides data storage and access, but applications may need version control or access control ( such as locks ). GFS's master may also be too burdened to deal requirements from multiple large scale distributed system. So Google design a database system to manage structured data. That is Bigtable, which is a combination of other techniques of GFS and Chubby.

Bigtable's logical structure is a sparse multi-dimensional map. The row is used to discribe client(such as URL, sessions), and it is called tablet. While column is used to discribe application's features ( such as anchor, contents and authors ), and columns are grouped into column families. The data in each cell has a timestamp to control the version. The is the basic access control unit. So data with locality will be fetched together to improve performance. Three latest version of data in each cell will be stored. APIs are designed for application to control data in tables and client can also use script to process data.

This logical table is divided into several SSTable file according to tablet index range. The file is stored in GFS. Each SSTable also has its own internal block index to speed up search. The SSTable is controlled by a distributed lock system Chubby. ( Chubby is also used in other aspects such as master election, metadata table access ). Recently accessed part of a SSTable is stored in memory, which is called memtable.

There are three components in Bigtable, client library, master server and tablet servers. The master assign tablets to tablet servers, keep track of them and split or merge tablets according to their sizes. Tablet is used to process write/read requirments. Client library are used to provide API to contract with servers and cache temporary data. The tables are in three hierarchies. Root table describes the location of metadata table, and metadata descirbes the user tables' locations. All table access are control by Chubby lock system to avoid concurrent conflict operations.

There are some optimizations for Bigtable. SSTables in each server will be merged and compacted periodically to reduce memory and disk usage. Column families can be grouped again to achieve locality. Bentley-McIlory algorithm is used to compress data according to the nature that all pages from a single host are stored close to each other. Cache is used for recently read files. Bloom filter is used to summarize a SSTable so that it speed up searching. Logs are sorted first by table names so that when recovering, a table's operation can be got sequentially. SSTable on disk is immutable, while memtable is mutable, so they use different lock to speed up reading.

Bigtable is a database designed for Google applications. It has a simple design, so it can be used in various application and easy to maintain. But there are some issues that is not clear according to the paper.

Reliability is mentioned less in this paper, but I feel the reliability is guaranteed by GFS ( There is even no duplication in Bigtable, but there is in GFS ). What will happen if a server crash? To my understanding, if the master server crashes, there should be a backup server to take control. The backup server has a duplication of the metadata table to keep track of the user tables. If the tablet server crashes, if it can recover, the master will send the tablet range to the tablet server and continue the service; if it cannot recover, the master server will choose another tablet server to undertake the tablet range of the crashed one or it will split the tablet range to other servers. In all these recovery processes, the new tablet server needs get the SSTable, the related SSTable is stored in distributed fashion by GFS, not locally in the crashed server. So GFS provide the reliability.

The column granularity should be consider carefully for each application. If it is too fine grained, the table may be sparse. One example is the imagery table in Google Map. While if it is too coarse, the cell maybe too dense, which makes each tablet too large and hard to be split.

Bigtable is a in house developed file system by Google to store data for a variety of projects with various differing requirements (real time data serving, bulk processing, etc). Bigtable is similar to a RDMS style database but does not enforce the full relational data model. Bigtable supports a simplified model of the RDMS structure that stores string data (with both strings for column and row names) and allows for specification of the locality of data (on disk or in memory).

There are four major components to the data storage model for bigtable:

- Row's: Table row's are labeled as arbitrary strings. Any cell that is edited in a row has its cell modification done atomically. Row's are maintained in a lexicographical ordering which clients can use to their advantage (to get good data locality)

-Column's: Data stored together in a column family. This column family is a defined cell structure for the data (incorporates many columns) and is used to group column data in rows together. Each column is an anchor which is a cell in the family that can be accessed outside of accessing the entire family.

-Timestamp's: Time of the cell entry. This is used to allow for a history of cell entry's to be stored in the table.

-Data: Stored as a string. The use case they laid out in this paper is that most applications store their data serialized.

This paper goes on to discuss how this software was built (using a lot of existing google developed distributed systems which we discussed previously in class such as chubby and the google file system). The paper also discusses other features of this distributed system (compression, bloom filters, and locality groups to name a few). Then they go on to give some performance benchmarks for the bigtable system and how they are actually using this system in practice.

This paper was rather interesting. Big table is a new and novel approach to storing data that takes less of a database approach to storing data (such as supporting complex relational queries) and instead uses a much less complex method of storing and retrieving data that allows for higher performance. One interesting comparison that could be made would be between bigtable and amazon's dynamo. Bigtable seems to be a much more heavy duty version of the dynamo distributed key value store with how it operates (for instance allowing for usages of column family's and storing rows of data by lexicographical ordering). Google is likely trading performance in some instances (such as random read's or writes which appear pretty slow in bigtable) for performance in cases where they are writing or reading larger amounts of data.

Overall I thought this paper was good. They discussed an interesting new model for distributed data storage and showed performance for the use of this system at some rather large scales (1 trillion cells for the web crawler for instance).

David Capel, Seth Pollen, Victor Bittorf, Igor Canadi

The paper is about Google’s way of handling large structured data and building indexes on it. BigTable is a distributed system built on top of GFS that provides subset of database semantics. Some features of databases, like relational model, doesn’t scale well, so BigTable implements different data model and cuts down on features to be able to scale to amount of data Google has to manage.

Data model is very interesting and we argue it’s really good trade-off between usability and scalability. Every table has a main column or a ‘row name’, on which a client can do range queries and even regex matchings. Grouping columns into column families is a good idea, not only for managing coarsely grained Access Control Lists, but also for modeling different access patterns. It’s not that interesting for smaller tables, though, so it’s not that common; for example, DynamoDB doesn’t have column families. While full-fledged databases offer table level atomicity with transactions, BigTable’s only atomic guarantees are on a single row.

The main idea behind implementation is separation of storage and logic. Storage is handled by GFS, which provides reliability, scalability and availability for free. This separation enables each tablet having only one server responsible for it. If it fails, data is still there and a new server can take over. Having only one server per tablet enables us to do nice caching and row-based transactions. Caching is no problem since there are no concurrent writes or reads at all - a tablet server is the only one accessing its data. Really interesting transaction model is atomic read-modify-write. Also, client can supply arbitrary scripts to execute in server’s address space. If storage and logic would not be separated (and we would still want replication), the logic would also have to be replicated, thus introducing lots of complexities in handling atomic and concurrent writes and reads and properly doing caching. One bad thing that’s come out of this decision is really bad performance on random reads. Random reads are impossible to cache so a logic server has to go to GFS server on every client’s request, hurting both latency and throughput.

We had a short discussion about effects of every write hitting the disk on GFS. If every write actually hits the disk before it returns, it’s a serious write latency bottleneck compared to Dynamo, which writes to disk lazily in the background. However, if file write cache on GFS is big enough, and write returns before it’s actually written to a disk, then the performance is not sacrificed.

We liked the idea of having two writing threads and when one is flakey, just write to another.

We found the recovery protocol very interesting, especially the idea of sorting the commit log and having every server reads commits only for tablets he got the responsibility to manage. One flaw is that they don’t evaluate how long does it take for tables to become available after their server failed. They note that transferring one tablet from one server to another takes about 1 second. If a server fails and all of its tablets have to be transferred, this may have bad impact on availability, since the tables will not be available during the transfer operation. There is an easy way to solve this problem using hot backups - when a server dies, its hot backup picks up and continues serving the data. With a cost of slightly more resources allocated, the availability could be much higher.

They also don’t talk at all about the possibility of using GFS locality to load balance tablets across the servers. If a BigTable server would be on the same node as GFS server, the co-located data would have great random read performance.

One big problem they didn’t solve are hot spots. If a tablet grows past certain size, it is split and divided between servers. This is good for write hot spots. However, they don’t have any technique for handling very high read volume for certain row or subset of rows in one tablet.

We also had a discussion about main differences between keeping data in GFS vs. BigTable. BigTable offers database semantics, like range queries, indexed data and atomicity guarantees. Apart from that, what BigTable does can be described as converting random writes to sequential append-only writes GFS is optimized form. It does that by using commit logs and SSTable files. It does so, however, by killing performance of random reads, since they have to be fetched from multiple data sources in GFS and memory.

BigTable is a Google’s storage system that keeps petabytes of structured data distributed across thousands of servers. The paper summarizes the design choices, usage, and results obtained by using BigTable inside google.

The problem is very natural: Google has many applications which need a system that allows them to store/retrieve structured data. The main challenge is to have a system which can provide that service reliably and efficiently. The first sub-challenge is to build this system so that it is suitable to all applications. The solution is, not surprisingly, a map. BigTable is basically a distributed sorted map whose key is a (row,column,timestamp) tuple and value is a raw byte array. Fortunately, there are other Google systems that makes things easier, such as GFS to store files and Chubby to manage locks (and many other tasks).

BigTable has got people’s attention and several clones have been developed. The reason for that perhaps the possible challenges and design choices to make while developing such system. Among many design choices I liked the partitioning of tables into tablets into SSTables. This makes partitioning and caching easier and I believe the key to the simplicity of the system. This choices are, of course, possible and applicable thanks to what is already available in hand, i.e., GFS and Chubby.

I believe BigTable targets a more general problem and more publicly usable in compared with other internal Google services. In fact, it one of the few such services available in Google App Engine. I wonder why while Amazon is fastly turning many internal services into commercial tools, Google is doing very slowly on this. It may, of course, be a business decision. On the other hand, I suspect that Google’s services are too specific and not reliable in compared with Amazon’s. It is also interesting that while most of the Google’s infrastructure is written in C++, the public APIs are in Java or Python. Given that Amazon is more successful converting internal services to commercial tools and its infrastructure is mostly written in Java, I suspect that it is a reliability issue.

There is an increasing demand on keeping large data and it seems that NoSql systems are prefered to satisfy this demand. BigTable is a very good example of a NoSql system to deal with keeping large amounts of data and responding large number of requests efficiently. I am not sure about the current research direction on this, but I believe sophisticated querying mechanisms will be required to be developed on top of such systems. Especially, semantic web technologies help obtain extracting invaluable information from data and key-value stores are very suitable to semantic web graphs (as subject-relation-object). Therefore, it may be desired that these systems implement declarative querying mechanisms and help answer such sophisticated queries.

The paper discusses the design and implementation of a structured, distributed
storage system that is used by many internal services at google. The main
objective is to provide a dynamo like system with more structure to the data
in the form of a traditional database like tables that are indexed. Data is
timestamped and identified by a row key and a column key. Though the
underlying system is a key-value sort of system, the structure added on top of
it makes it look like databases and lets you perform scans etc. It provides
scans like databases using regular expressions also has a feature to support
the server executing client scripts on data. It provides atomic read, write
and read-modify-writes even though the backend storage is distributed. Writes
can also be batched. Values can be timestamped by either the client or the
server and it supports automatic garbage colllections based on timestamps when
multiple versions of the same data are stored. Timestamps are used to resolve
conflicts.

Bigtable is implemented as a client library linked to the client code and a
set of tablet servers which is managed by a master server. The data is stored
in form of SSTables using GFS and a table is partitioned into tablets. A free
server is alloted a tablet by the master depending on the current load it
handles and it serves all client request to that particular piece of data. The
master just manages the tablet servers and the client talks directly to the
tablets to service its request. The metadata to locate a tablet is maintained
as a three-level hierarchial tablet location service maintained through chubby.
Each tablet server also holds a lock in chubby to perform operations on a
tablet assigned to it by the master. It also uses chubby to elect a new master
when a old one dies. In bigtable, storage and management of data is clearly
isolated with the tablet servers persisting data in form of SSTables and
commit-logs written to GFS. Each tablet server maintains a in-memory copy of
the operations in a mem-table. This makes recovery faster in case of
tablet servers failing. The authors also discuss several optimizations to
this design like bloom filters, compression, compaction and caching to improve
the performance and latency.

The paper highlights how google uses chubby and GFS internally to build
services on top of it. I think it provides a nice service that has features of
both the SQL and NOSQL worlds; having structured data without strict data
format without support for joins, but providing scans and user-scripts.
Traditionally databases do a lot of analytical processing on the stored data
and key value stores present plain data. The master and tablet server
management is similar to the master and chunk server in the underlying GFS.
Like in GFS, the clients dont rely completely on the master and interact
directly with the servers managed by the master.

The idea of clearly isolating the tablet server's tasks from the backend
storage is interesting. This lets them scale the system by adding more tablet
servers as the data increases. The master can also be made to take into account the
location of the tablet in GFS to assign tablet servers.

Bigtable -

Bigtable is a distributed database system designed to scale to many hundreds of terrabytes over hundreds of machines. It was designed by Google and implemented on top of the google file system ( GFS ) and the Chubby lock service. While it’s obviously succesful given all of the services that are running using bigtable, there are many interesting features that could use further exploration.

One design feature that seems odd about bigtable is mapping one sstable to one tablet server. As sstables are stored on gfs and gfs has, by default, data replicated on three machines it seems that a tablet server could be run in a distributed fashion as well. Say for instance the google earth imagery index for a specific image right after an event like the christchurch earthquake, mt saint helens etc. It seems like bigtable couldn’t deal with hotspots.

Bigtable uses bloom filters. Bloom filters require K hash functions. As the input to bigtable can vary over any possible data type, finding a good universal set of hash functions might be a challenge. It would be interesting to find out what they did here.

In section 4 they talk about the percentage of server hours that bigtable was unavailable due to a chubby outage. The numbers are all incredibly small, up to a maximum of 0.0326% for one cluster. If in general bigtable is very reliable this number might be misleading. Say for instance that the percentage of server hours that bigtable was unavailable was 0.0326% then chubby outages are responsible for 100% of the outages.

On the subject of outages, in section 5.2 they talk about a cluster management system that starts master bigtable servers. It’s also mentioned that the master can fail. Is the high availability of bigtable due to the rarity of a crash or is it due to fast bigtable recovery time ( ie that cluster management service detects bigtable failures quickly and the master can recover quickly ).

The performance studies where a little disapointing. As bigtable maintains tables in lexographic row key order it appears that a worst case scenerio could be devised. Say for instance that you write a large amount of data in lexographic order and then go back and write one point every table size rows. It would seem this would induce a large overhead in table splits, compactions, load balancing etc.

A piece of the implementation that would have been interesting to discuss is the authentication subsystem. Tablet servers are supposed to check and see if the sender of a row mutation is authorized to make the change. Are row mutations cryptographically signed? The paper only says that a list of permitted writers is stored in GFS. Are they simply trusting in the security of their data center?

Bigtable is a distributed storage system that is designed to support storage of petabytes of structured data with varying sizes (URLs to large images) and types of workload varying from batched to real-time processing.

Organizing terabyte/petabyte scale data for performance (retrieval), efficiency (storage) and availability using commodity machines is the major reason for requiring a new distributed storage system since centralized database systems are not known to scale well with such huge data set. This paper proposes a design of a distributed data store that is used by many of Google's services.

Major attractive aspect of Bigtable's design is its simplicity. Unlike a traditional database, the Bigtable's data model supports only a single "big table" organized in rows of attributes with a timestamp. With such a simple data-model, all that is left to solve is efficiently storing and retrieving data on a distributed cluster of commodity machines. The "big table" is partitioned into independent tablets (a range of rows) and the requests for rows within a tablet is handled by a single server in the Bigtable cluster. A single master (of a chubby cell) is used to distributed each tablet to a server. All these design decisions made the problem simpler and easier to handle. Another important contribution is on how they achieve consistency (from the description, it seems like Bigtable guarantees sequential consistency) considering the scale of data the system is handling. They achieve this again by making the design simpler by: 1. making the SSTable immutable and 2. doing a copy-on-write on updated rows in memtable (and allowing reads and writes to be serviced simultaneously). This solved various consistency issues during failure and recovery. Also, using logs to recover from a failure of a tablet server (which utilizes the append operation of GFS) simplifies the recovery process. Though there are some sides effects of these design decisions, for example, reads might translate into lots of disk accesses (when some SSTables are on the disk), the authors suggested enhancements (caching, bloom filters, etc.) to partially mitigate such problems. Another interesting contribution is the simple technique of organizing the rows in the Bigtable for high locality (and also for high compression ratios) for some example use cases.

Overall the design was made very simple to reduce the problem into a manageable size. But, some design choices made the recovery time on a tablet server failure, very long. For example, it is not completely clear why a single commit log was used for a group of tablets. Probably enough evidence was not presented to support this trade-off between performance and faster recovery. The co-mingling of commits to a single log file definitely complicates and slows the recovery process (as admitted in the paper). Further, considering the fact that the tablet servers are commodity systems that are subject to frequent failure and since a tablet is handled by only one server (maybe a single server can service more than one tablet), it increases the importance of assuming failure as a common case. Further, it is hard to conceptually visualize various possible failures cases of Bigtable when it depends on many different services (Chubby, GFS) that can independently fail. This in turns makes it difficult to reason about fault-tolerance of Bigtable.

Bigtable stands out as an excellent example on how to design a storage system for large (growing) petabyte scale data that is frequently scanned/searched (range queries). It is not that much suited for large databases that does too much of random row reads/writes.

In this paper, the author described the BigTable system from Google. It is a large scale distributed storage system for managing structured data.

BigTable has a different data model than traditional models. The value in the table is indexed by 3 keys: two strings and a timestamp. It’s like a two dimensional hashtable with a timestamp to differentiate versions. This model is very good at processing web pages. URL, tag name and update time can be used as keys to access the content.

In Bigtable, the data is maintained in lexicographic order by row key. And each row range is called a tablet which is the unit for distribution and load balance. Tablet consists of a list of SSTable which is ordered immutable map from keys to values. The data is stored in a temporary buffer called memtable before converted to a SSTable during which memory usage will be shrinked and data are written to persistent disk. SSTable can be compressed by different format. Because all the pages of a single host are stored close, good compression ratio can be reached easily.

BigTable is based on many other systems in Google. Such as Google File System and Chubby Lock Service. By building the system on GFS, they don’t need to worry much about the reliability. They only need to check the part of commit log of the data that has not been written to SSTable during recovery. The Chubby Lock Service can be used to manage the tablets metadata. The tablet location information is stored in a three-level hierarchy analogous to B+ tree. The root of the tree is stored in Chubby which ensures that only one BigTable master can change the tablets metadata at a time. Bigtable also uses Chubby for many other tasks, such as storing the bootstrap location of Bigtable data, Bigtable schema data, access control list and so on.

BigTable uses a lot of refinements to improve the performance of the system. First, the columns of each tablet is grouped together and columns usually accessed together are grouped in one locality group. On one hand it provides efficient read, on the other hand it enables customized tuning for specific locality group. Second, it uses different compression algorithms to compress the data. Also it has two-level cache to improve performance. The Scan cache is good for the application that read the data repeatedly while the Block cache is good for applications that tend to read data that is close to what they recently read. Another important refinement is the commit log. They uses only one log file for a tablet server which means all the tablets on a server share the same log file. It reduces a lot of GFS file operations and save a lot of disk seeks. The disadvantage is that it takes more log analysis work for the recovery.

The author provides a evaluation of read/write performance of the system as well as the characteristics of the tables in production use. It demonstrates the capability of Bigtable that it can be used for a variety tasks.

Post a comment