Dynamo: Amazon's Highly Available Key-Value Store
Dynamo: Amazon's Highly Available Key-Value Store.Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels. Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007
Review this or other paper for Thursday, 2/1.
Comments
1. one or two sentence summary of the paper
This paper introduces the design and implementation of Dynamo, a highly available key-value storage which is used by some of Amazon’s core services to provide an “always-on” experience. In order to achieve highly availability (especially highly “write” availability for certain Amazon services) and to provide certain level reliability, scalability, durability and consistency, several techniques are applied in Dynamo. Technique like vector clocks with reconciliation during reads allows high availability for writes; Sloppy Quorum and hinted handoff, and anti-entropy using Merkle trees are used to handle temporary failures, and recover from permanent failures respectively. In order to achieve incremental scalability and to solve portioning problem, consistent hashing is applied. And gossip-based membership protocol and failure detection is used to detect membership and failure.
2. A description of the problem they were trying to solve
The main problem this paper trying to solve is how to achieve “always-on” level availability , while at the same time other important measurements including reliability, scalability, durability as well as consistency should meet the service level agreements (SLA). For example, in order to achieve such high level availability, consistency under certain failure scenarios is sacrificed.
3. A summary of the contributions of the paper
Actually, many techniques used in Dynamo are “borrowed”, and most of them are well studied in other papers. The main contribution of the paper is not inventing or proposing some new techniques but how to combine different techniques to provide a single highly-available system. And the paper provides a solution in the tuning of the techniques to meet the requirement of production systems such as SLA. In addition, since the highly availability is achieved at the cost of strict consistency, the success of dynamo demonstrates that eventually-consistent storage system can be applied in production with demanding applications.
4. The one or two largest flaws in the paper
One limitation of the paper is that the gossip-based membership protocol may significantly limit the scalability of the whole system, since according to such protocol, each node in the system should gossips the full routing table with other nodes. Therefore, the network bandwidth and delay resulting from such communication may lead to significant degradation of performance when the scale increase to certain level.
Although it is claimed that security is not a big problem for system for it is used as internal service of Amazon. But there is a risk that the system may suffer if the security problem is stemmed inside Amazon. Thus, some security issues may also need to be considered for this system.
5. A discussion of how the ideas in the paper are applicable to real systems.
Actually, the ideas in this paper is carefully designed and selected based on some requirement and properties of applications in real world. For example, “write” operation should be given much higher priority in order to guarantee “always-on” experience, and the observations that no complex query is needed for many applications, thus Dynamo only focus on some simple key-value queries application.
Posted by: Junyan Chen | February 2, 2012 07:57 AM
Motivation: E-commerce serve millions of users, different applications require a effiecent, scalable, reliable distributed storage system. Dynamo use a synthesis of techneques to achieve these requirements and provide a key-value store interface to application.
Assumption: Query Model, weak consistence, no isolation, Service Level Aggreement
Design: (1) Interface: get(key), put(key, value), data have a context for version. (2) Partitioning: hash all keys to a ring, each node take serveral parts of the ring for load balance. (3) Replication: A node and its N successor in the ring stores the key-values in a interval. (4) Data versioning: eventual consistency guarantees final consistency in case of no node and link failue; in case of failure, all unconsistent versions are stored and provided to application. (5) read/write: load balancer choose a coordinator, coordinator read/write from top-N nodes for the newest version. Assign W for successful write and R for successful read(W+R>N). (6) Failure handling: Tempory failure is handled by hinted handoff, a node periodically detect the failure node. Permenant failure is dealed by comparing Merkle tree of related nodes. Failure Detection: gossip-style protocol. (7) Adding/Removing nodes: calculate key ranges for all nodes, move data to new key range, add/remove node.
Conclusion: Dynamo is a decentralized system which provide key-value store to application. It is available, effiecent and resistent to failure.
Discussion: the storage is distributed, but the processing such as load balancing may be be bottleneck.
Posted by: Wenfei Wu | February 2, 2012 07:55 AM
1. one or two sentence summary of the paper
This paper introduces the design and implementation of Dynamo, a highly available key-value storage which is used by some of Amazon’s core services to provide an “always-on” experience. In order to achieve highly availability (especially highly “write” availability for certain Amazon services) and to provide certain level reliability, scalability, durability and consistency, several techniques are applied in Dynamo. Technique like vector clocks with reconciliation during reads allows high availability for writes; Sloppy Quorum and hinted handoff, and anti-entropy using Merkle trees are used to handle temporary failures, and recover from permanent failures respectively. In order to achieve incremental scalability and to solve portioning problem, consistent hashing is applied. And gossip-based membership protocol and failure detection is used to detect membership and failure.
2. A description of the problem they were trying to solve
The main problem this paper trying to solve is how to achieve “always-on” level availability , while at the same time other important measurements including reliability, scalability, durability as well as consistency should meet the service level agreements (SLA). For example, in order to achieve such high level availability, consistency under certain failure scenarios is sacrificed.
3. A summary of the contributions of the paper
Actually, many techniques used in Dynamo are “borrowed”, and most of them are well studied in other papers. The main contribution of the paper is not inventing or proposing some new techniques but how to combine different techniques to provide a single highly-available system. And the paper provides a solution in the tuning of the techniques to meet the requirement of production systems such as SLA. In addition, since the highly availability is achieved at the cost of strict consistency, the success of dynamo demonstrates that eventually-consistent storage system can be applied in production with demanding applications.
4. The one or two largest flaws in the paper
One limitation of the paper is that the gossip-based membership protocol may significantly limit the scalability of the whole system, since according to such protocol, each node in the system should gossips the full routing table with other nodes. Therefore, the network bandwidth and delay resulting from such communication may lead to significant degradation of performance when the scale increase to certain level.
5. A discussion of how the ideas in the paper are applicable to real systems.
Actually, the ideas in this paper is carefully designed and selected based on some requirement and properties of applications in real world. For example, “write” operation should be given much higher priority in order to guarantee “always-on” experience, and the observations that no complex query is needed for many applications, thus Dynamo only focus on some simple key-value queries application.
Posted by: Anonymous | February 2, 2012 07:52 AM
This paper summarizes amazon's high availability key-value store called 'dynamo'. Dynamo deals with some failure conditions by giving up object consistency. It makes use of versioning and application layer version resolution.
In my mind the main contribution of this paper is finding a good application for consistent hashing. That it's destroyed the utility of persistent connections prefetching etc in the web cacheing paper made me think this idea was not practical.
Reading of the design decisions made while designing this system was impressive. It would be interesting to see how they made the decision that all service level agreements be spec'ed at the 99.9% level. They talk about a cost benefit analysis for not going to a higher percentage, but not lower.
They did resolve one of my complains about gravevine. In grapevine conflicting updates where resolved by choosing whichever update came last. Given any two settings that are inter-related this method of conflict resolution can fail. The dynamo response is to push the conflict resolution up to the application. The 'get(key)' call mentioned in section 4.1 may return a list of objects with conflicting versions. Dynamo appears to be very concious of allowing the application more control. An application can even spec the type of data store ( BDM, mysql etc ).
While the dynamo gives no guarentee about consistency, it is curios if some synchronization algorithm could be implemented on top of dynamo. It's unclear, but if you cannot the utility of dynamo might be lessened.
Flaws? If there are any flaws in this paper it's that there where not enough examples of the use of dynamo in real applications, and the details of dynamo that are present where a little light weight. For instance they state that dynamo is configured such that each object is replicated across multiple data centers, but does not state how this geographic spread is achieved.
Posted by: Matt Newcomb | February 2, 2012 07:40 AM
This paper describes the design and implementation of Amazon's Dynamo
distributed persistent key-value store, and reports on its real-world usage in
supporting Amazon's core operations. As compared to the strict ACID
properties of a traditional relational database system, Dynamo compromises
some degree of consistency to provide increased performance, availability,
and scalability.
Dynamo was created in order to satisfy Amazon's requirements for a
high-performance, high-availability storage system used by its various
internal services (shopping carts being the main example used repeatedly in
the paper) to provide its primary e-commerce services to customers in a
smooth, reliable fashion. In order to achieve this, it has to provide a high
degree of fault tolerance (to gracefully handle hardware failures) and
scalability (so that more nodes can be easily added to accommodate high-volume
services). In designing the system for these requirements, the authors employ
a number of established techniques (such as hash trees, consistent hashing,
and vector clocks) to address a variety of unavoidable problems with a
large-scale distributed system.
One of the most important attributes of the Dynamo system is its ability to be
tuned to the specific needs of the client application, primarily via
parameters governing the replication count (N) and the number of participating
nodes required to carry out a read (R) or write (W) request. By balancing the
values of N, R, and W, applications can adjust the availability and
performance characteristics of a Dynamo instance as needed, for example to
increase availability for writes at the risk of version divergence or data
loss, or to trade off performance and availability in order to achieve a
higher degree of consistency.
Most of the problems that arise in the paper are dealt with in a sound manner,
such as the use of vector clocks to disambiguate data inconsistency where
possible. However, some other problems were met with much less elegant
solutions, such as the clock truncation strategy used to deal with overgrowth
of vector clocks. Also, the designation of some nodes as "seed" nodes seems
to be in non-trivial conflict with the goal of symmetry put forth at the
outset as a "key principle". (Load balancers may also be a point of
asymmetry, though it wasn't entirely clear to me if those were considered part
of Dynamo itself or merely an accessory.)
Posted by: Zev Weiss | February 2, 2012 07:27 AM
Dynamo is amazon's distributed key value store. The main intent of their store was to be always available so that queries execute successfully irrespective of any major crashes. It was initially meant to an ever-writable system though due to its configuration parameters, it is possible to make it always-readable and also durable by sacrificing performance. Being a key-value store, it aims at providing low latency reads and writes for single keyed objects of not a big size. The operations involved reading and writing one keyed object at a time from the store.
This paper explains the design and implementation of dynamo along with some of the problems it is solving. The implementation is done by a consistent hashing technique where nodes enter and leave a ring. The node is accessed by applying a hash function on a key and then looking for the node that handles the appropriate range that contains the value. Dynamo builds on top of this by employing replication on top of the hash ring. There is configuration parameter N that mentions the number of replicas. The use of the hash ring makes it scalable in adding and removing nodes to the system.
Dynamo sacrifices consistency for performance and availability. This is determined by parameters R and W which correspond to minimum number of nodes to respond for Read and Write. A small value of R would mean lesser consistency of Reads and bigger value means more consistency. Dynamo maintains versions of updates as the consistency is sacrificed for availability.
One of the major problems in distributed system design is reconcilation of conflicts. In dynamo, Reconciliation is done by maintaing a vector clock of updates which is a standard technique employed for resolving updates across time intervals. Vector clocks have the disadvantage of growing in memory if the number of updates in a system grows large, dynamo sets upper limit of vector clock size and trims the oldest entry each time the size grows above this limit. This idea of using vector clocks in key-value stores is followed by other widely used key-value stores like voldemort after looking at its advantages in dynamo.
This paper also gives a good overview of the failure handling that is done in dynamo. Hinted handoff is an interesting way to handle failure in case of node inaccessibility. It is also an active area of development where many enterprises are building this as a configuration on top of existing RDBMS like postgres, SQL server, etc. Even with hinted handoff, there is a possibility that the writes are lost when the replica node to which the request is handed off goes down and the actual node then comes up. Dynamo has a very efficient way to synchronize the replicas using merkle trees which presents a hierarchial way of finding out unsynchronized nodes.
Some doubts in the implementations in the paper:
1. One small doubt in my mind was the fact that every node had to communicate with every other nodes to get the knowledge of where to route the requests. This could have been easier by using a central registry with info about all the nodes and these registry nodes can in turn be replicated. I agree this can't be done in all cases but in cases like storing membership of ring, this could have been a good solution.
2. instead of sending a read request to all servers and waiting for the results from minimum number to return, a smaller healthy subset of nodes can be picked at random each time and queried. Random selection could mean you choose some number of nodes that are going to be alive most probably.
The paper also discusses some of the classic ways by which performance can be increased in a distributed system like client side direction of requests instead of relying on the server to route to the correct node.
Overall, I felt Dynamo was an excellent study of a distributed key-value store design. It looks rather evident from the performance results in the paper that this solution would have helped them achieve significant SLA for such large load. This is a classic implementation and has provided ideas for the development other key-value stores' implementation.
Posted by: Srinivas Govindan | February 2, 2012 05:54 AM
The paper discusses design of and experience with Dynamo, Amazon’s highly available key-value store. The goal of Dynamo is to be a distributed system that is always available, and one where all customers have a good experience. With this goal, it compromises strong consistency for high availability, and is designed to be an eventually-consistent storage system. The paper presents how such a system can be designed to meet production requirements with very strict performance demands, by using different techniques. It also discusses experiments done with variations of some policies in the system, and how they affect the system’s efficiency.
Dynamo do a good job of adapting existing techniques to meet its own system goals. It modified consistent hashing to use fixed sized partitions, and separated the concern between data partitioning and data placement. Moreover, in order to improve durability, it relaxed the restrictions on checking quorum membership and used the first N healthy nodes for read and write operations. It is a system with clear design goals and clever technique choices. In addition, the system design has a great separation between policy and mechanism in order for the system to be applicable to a variety of services. It is customizable for desired performance, durability and consistency, by configuring how many replicas to keep and how many of them to access/update for a successful read/write operation. It also has the option for the application to resolve conflicts on read, which happens often by the system’s design choice.
It is not easy to find weaknesses for the system that was used in production with millions of users. One possible weakness is that the paper does not provide strong justification for some problems. For resolving conflicts between different versions of files, Dynamo truncates the number of pairs in the vector clock after it reaches a certain threshold. Although it could lead to some inefficiencies, the paper only gave the fact that they have not experienced such an issue.
The design and techniques presented in the paper already focused on applicability since they are for the system that is in production. One interesting idea from the paper is how they measure their system, where measurements were made for 99.9th percentile of the users rather than for average users. I think it is a great way to measure the systems that focus on users (customers), since the values can be different by orders of magnitude, as seen in Dynamo.
Posted by: Min Thu Aung | February 2, 2012 03:32 AM
This paper discusses the Amazon.com Dynamo storage system which is designed to maximize availability and speed. It is a key-value storage system that is simpler than typical database implementations which means it lacks a comparable consistency model but can perform tasks more straightforward and with fewer chances for system failure.
The major goal of the Dynamo system was to provide data storage capable of high availability and speed even during times of server failures. Previous designs had made consistency the highest priority instead. Amazon.com wanted a flexible design that was almost never down and was expandable. They needed eventual consistency but not something immediate, so the design required that everything eventually reach a steady state of consistency given enough time. Redundancy of data was another requirement for the design of Dynamo.
A major contribution of the Dynamo storage system is to present an alternative to systems designed for high consistency. Those previous systems are less likely to provide good availability and speed to a user. Amazon.com designed the Dynamo system for simpler tasks that do not require the more complex relational model. Instead, they were able to focus primarily on availability and speed. These are especially important for customer oriented sites like Amazon.com where a small amount of downtime or loss of responsiveness can have a large negative effect on a customer. As the number of servers increases, so does the chance for one of them to fail. So Amazon.com, with as many servers as it has, assumes that there will almost always be a server at the point of failure and under this circumstance overall availability and speed must still be available for its customers. It is not enough to be able to recover from failure, the system must be able to completely hide the failure from the user. It was designed to be capable of handling the failure of individual servers and quickly forwarding traffic and preparing for the eventual return of the servers. A specific command was provided for permanently removing or adding individual servers. Unless this command was received, the rest of the system assumed that the currently unavailable server would eventually return. Updates that occur while the system is down are saved and later returned to the server once it becomes reachable again. Amazon.com used consistent hashing to determine the server to use for data. This helps with accessing data directly as there must not be multiple hops between servers which would reduce average responsiveness.
One assumption that seems to be made is with the availability of the server overall and how it is always writable. The question of more specific details of what should happen if a truly extreme failure causes this always writable policy to fail. What sort of alert should be provided. The assumption that everything will always work may usually be true, but perhaps that is assuming more about a system's stability than should be made.
The general Dynamo system seems useful for many web storage tasks, as seen in the many uses Amazon.com has itself found for it. Quick responsiveness and availability are desired for most website activity. This is assuming that the more limited nature of the consistency and lack of a more general relational model would not be a problem for a given data storage.
Posted by: Daniel Crowell | February 2, 2012 01:55 AM
This paper discussed the Amazon distributed key-value store called Dynamo that was developed for internal use to store key-value data for various online Amazon services. Dynamo uses a combination of various other techniques available in distributed data storage today to create a solution that is both highly available and eventually convergent (Consistent hashing, object versioning, a decentralized synchronization protocol, and a “gossip” framework for deteriorating node failures). The main end goal of this system is to provide a highly available and eventually consistent framework for storing simple key-value information. The reasoning behind developing this framework was the lack of currently available methods to store the particular data that Amazon wanted to store (in this case shopping cart information). A few existing method's were discussed in this paper and a brief reasoning as to why they were not used was given. One possible key-value store that they were explored by the authors was a distributed database to store this information. However the scaling of such a solution (due to query complexity) was not ideal (valued immediate consistency over availability).
Dynamo was designed to have a few specific unique features for the particular use case that Amazon had in mind for this system (usage in the Amazon store shopping cart). One of the more unique design goals they were trying to achieve was a system that was always available for writes. This is a pretty big departure from most systems that I am personally aware of (most are always available for reads not writes). The way they accomplish this is by allowing a write to be accepted by a node regardless of the state of the network (or of its peer nodes). The idea is that by allowing the write to complete and forcing inconsistencies to be resolved later would improve user experience in cases where writes are frequent. The other unique feature was how they kept reading and writing latency down by using a “zero-hop” DHT. Each node is given enough information to find the keys requested of it on other nodes without having to search for the nodes via requests to a centralized server.
This paper offers a pretty interesting approach to solving the key-value problem. The usage of principals from consistent hashing to take care of server location and replication was a nice touch. However this particular implementation as stated in the paper seems to be useful for only special types of key-value store information. Specifically cases where inconsistencies in the value's are easily merged. The case they detailed in this paper could be merged relatively easily by an application. Other uses that have data that are more difficult to merge might find that this particular method of key-value store to be unusable. The other minor issue with this implementation is the use of “zero-hop” DHT's specifically how much data is needed on the node to use this strategy. As mentioned in the discussion section of the paper the DHT would limit scalability by moving full copies of the routing table between peers. It would have been interesting if they went into more detail on the possible solution they mentioned (hierarchical extensions to dynamo). I may be off base in my interpretation of this section but they seem to be suggesting using a multi-tired DHT structure where some “super node” style nodes contain information for all the nodes while the standard Dynamo nodes only contain routing information for the nodes around them.
Posted by: Benjamin Welton | February 2, 2012 01:46 AM
Amazon's Dynamo paper describes a key-value store designed for clusters of
hundreds of machines. The top priorities of the system are availability and the
ability to deliver on high performance guarantees.
Amazon is a large company that is heavily used by internet users for purchasing
just about anything. Thus, it makes sense that they would have a need for
incrementally scalable persistent data serving techniques. Many Amazon
applications also do not have the need to be retrieved with complicated queries,
so Dynamo was developed to provide a simple way to serve data that allows high
performance and easy scalability. The main problem that Dynamo is created to
solve is the problem of providing a highly interactive yet data-rich environment
for Amazon customers.
The paper first contributed validation of some distributed systems techniques.
Things like object versioning and consistent hashing were not presented
originally with results sections that boasted test clusters with the size and
popularity of usage of data centers owned by Amazon. The further innovation of
these techniques (such as fixed size partitions for consistent hashing) is
harder to achieve without this deployment experience. Also, it served as another
large-scale validation that relaxed/eventual consistency does not have many
detrimental effects when compared to the performance and availability benefits.
It is admittedly difficult to find glaring flaws in a system with a clear
service goal that it meets at a scale that is not trivial to reproduce. I
noticed that the number of servers that was "common" for applications to use for
a "preference list" for a read or write was 3 in the paper. However, they also
mentioned that a preference list will have members from multiple data centers to
increase availability. I wondered if with such a small number of possible
servers to use, if gaining a quorum on an operation was often waiting on
responses from datacenters that are physically far away from each other. This
would happen less often if the preference list were longer. Dynamo's
full membership model definitely did not seem scalable to greater than thousands
of nodes. It would have been nice to see more attention to this level of
scalability, but I do not think the authors were unjustified in mostly ignoring
this given the nature of experience papers.
An idea I found especially smart and applicable was planning for the system to
meet response time goals for the user in the 99.9th percentile of the response
time distribution. Almost all distributed systems (or even non-distributed
computer systems) have to deal with bursty usage patterns, and this is a good
solution to surviving heavy bursts of requests of a system. The decision to
deliberately not meet the ACID properties is an important one for all
distributed systems designers, since there is almost always a tradeoff between
high consistency or high availability or performance.
Posted by: Evan Samanas | February 2, 2012 01:24 AM
The paper talks about Dynamo, Amazon's highly available distributed key-value store. The goals of Dynamo are to provide high availability and performance by sacrificing consistency. The authors discuss the Dynamo's design goals and SLAs it has to satisfy. They also provide nice introduction to related work in distributed storage systems. The paper continues with Dynamo's architecture, which is combining known techniques (like consistent hashing, vector clocks, quorum, etc) to create a system specialized for dealing with Amazon's services's requirements. They continue with details of the implementation and operational experience of the system under heavy load of millions of requests per second. There is a good discussion about improvements to the system they made over time, especially about achieving uniform load distribution.
At the time the authors started developing Dynamo, there was already a lot of distributed storage systems out there, most of them open source. However, Amazon had special use cases and requirements that would benefit a lot from custom designed system. On such large-scale, even minor improvements would lead to huge savings and higher customer satisfaction. One of the biggest difference from other systems was the emphasis on write availability instead of read availability. Performance was also a big consideration, and the conflict resolution policies. Authors also realized that most of Amazon services's storage needs were satisfied with simple key-value store, and relational databases were a big overkill.
When building the system, authors used known techniques to address major distributed system storages issues. For partitioning, they used consistent hashing, for failures, the quorum mechanism, etc. The real contribution of the paper was testing the techniques on huge scale, with tens of thousands of servers and millions of users's requests per second. They also adjust some techniques to perform better, which is also their big contribution. For example, they describe new strategy for load distribution that decouples data placement from partitions and gives them more uniform load across the servers. It also makes it easier to send data to new nodes and archive it. There are lot of considerations that don't pop up in academic setting, but they are important when you try implementing the ideas on real systems. The authors describe those and this is a good paper to read if you are implementing your own distributed storage system.
I don't think there are any major flaws in this paper. When designing a system, it's all about trade-offs and the system clearly have some weaknesses, but they are traded for benefits in aspects that are more important for their specific use-case. However, I found the structure of the paper a little bit confusing and hard to follow.
To conclude, this is a great paper about design, implementation and operational experiences of distributed storage system designed to run on lots of commodity hardware servers and satisfy strict availability and performance metrics.
Posted by: Igor Canadi | February 2, 2012 01:04 AM
This paper introduces the design and implementation of Dynamo, which is a highly available and scalable distributed data store built for Amazon’s platform. The main focuses of this paper is on the reliability and scalability of the system.
The storage system required by Amazon has several requirements. The first one is the availability, since many approaches of data store chooses to sacrifice availability to consistency, and lower availability may decrease the user experience. The second one is dealing with failures as a normal case, while many existing approaches treat unfailing situation is the normal case. In this situation, dealing with failure always costs a lot. The third one is the efficiency on a specific query model, which only need primary-key access to a data store. More and more applications use key-value data store model. The fourth one is the scalability. It needs the system to be highly decentralized system and can add or remove storage nodes without requiring any manual partitioning or redistribution.
The paper presents Dynamo system, which meets the requirements mentioned above. The main contribution of the paper is that it demonstrates the decentralized techniques can be combined to provide a single high-available system, and the system has been successful in Amazon’s services. Partitioning, replication, versioning, membership, failure handling and scaling are discussed in detailed in this paper to show the way to combine different techniques to make the system work. Techniques including consistent hashing to partition, vector clocks to enable versioning, sloppy quorum, hinted handoff, and Merkle trees to deal with failures, membership protocol are combined together into a highly available system, Dynamo.
The main flaw of this paper is that it mentions the adding/removing storage nodes and the ring membership, but fails to tell the detailed information on how to reconcile between the new added node and the existing nodes. It also mentions about the external discovery mechanism, in which seeds play as functional nodes in a ring. Does that mean the seed is in charge of reconcile in a centralized mode? Since seeds seem to be the solution to prevent logically partitioned ring, it is not clear that if the reconciliation process is handled in some other way, such as peer to peer protocol.
In sum, the paper is a good reference to implement a high available data store for the systems nowadays. Dynamo’s successful experience in Amazon makes the idea convincing and practical in real applications. The system meets the requirements of distribution system including reliability, scalability, high-availability, and even efficient with key-value store, which is also popular in new applications. It should be highly applicable to the real applications.
Posted by: Xiaoyang Gao | February 1, 2012 11:52 PM
This paper describes a key-value store, Dynamo, that is aiming at high availability and achieving very low latency for most requests rather than only for average cases. Dynamo uses consistent hashing for load balancing and trade data consistency for high availability.
Dynamo is designed for Amazon’s own platform that serves people all around the world, hence involving huge number of requests. More than that, it must provide high availability even at the expense of data consistency because of Amazon’s special e-commerce background. In addition to scalability and high availability, Dynamo must be able to be customized for different application to achieve better performance.
One of the key contribution of Dynamo is it differs other common distributed systems that sacrificing availability for uniform data view. Dynamo abandons RDBMS according to the characteristics of its requests and adopts key-value store so that it provides narrower functionality while with higher performance, which makes it a pilot work in NOSQL. It also differs other work because of the model of eventual consistency rather than strong consistency so that it can continue to provide service even when network partitioning is happening. This is also a key point for achieving high availability. Another important contribution is Dynamo successfully integrates many advanced distributed system techniques, including consistent hash, vector clock for versioning and gossip based protocol for propagating changes. Among them, consistent hashing is improved by proposing three different strategies so that better load balancing and less data moving involved when nodes are added or removed. 99.9 percentile SLA is another amazing part of the system. According the paper, latency gets doubled when using 99.9 percentile compared with the average case. This indicates that Dynamo can really achieve what they are claiming in the real world workload.
One of the flaw of the paper is it does not address security issues at all. Even though Dynamo is only internally used, there is higher probability it gets attacked because of its high value data. The fact of not being open to external user does reduce the risk of under attack, but its high economical value may attract advanced hacks’ attack. Another potential flaw of Dynamo is its key-value store functionality. Even though now most of Amazon’s services do not need relational query, there may be more and more services requiring such functionality in the future.
In all, Amazon’s good user experience proves the validity of the design of Dynamo. Many open sources projects get inspiration from Dynamo, including Riak and Cassandra. Dynamo also becomes solid infrastructure for several Amazon’s new system, like DynamoDB.
Posted by: Xiaozhu Meng | February 1, 2012 10:42 PM
Seth Pollen
This paper describes the high-level design of Amazon’s Dynamo distributed key-value store. Numerous design features are discussed, including consistent hashing to control replication, Lamport clocks for versioning, quorum approaches to distributed agreement, hinted handoff for handling node failures, and gossip approaches to reaching eventual consistency. The problem being solved is to provide a highly responsive and durable datastore service. Reliable performance is of the highest importance. The traditional ACID properties of atomicity, consistency, and isolation are neglected.
It is interesting to compare Dynamo with Grapevine. Both seek to provide an “always-writable” guarantee (in the case of Grapevine, this means a message can always be sent, no matter how few servers are active). But Grapevine allows any message to be sent through any server and guarantees message durability once it is handed off to a server. Dynamo, on the other hand, does not guarantee that a write has happened until the client has contacted the particular coordinating server for the given key and the coordinating server has received a quorum of write confirmations from the other replica servers. Dynamo does include some asynchrony in its hinted handoff mechanism for preserving updates for nodes that are currently down. It is unclear whether the loose, asynchronous approach of Grapevine affords better scalability than the more demanding Dynamo model.
It is intriguing that, though they have designed this highly decentralized, flexible system, Amazon only operates it with 100 or so nodes per installation. It would seem that a company which must have tens of thousands of server nodes would want to combine them all into one monolithic system to provide better load balancing and resilience to local failures. Instead, each business service uses its own Dynamo instance, with its own servers and replication database. They do point out that expanding to tens of thousands of nodes would require a hierarchical directory structure (as opposed to a flat, fully replicated hash ring); perhaps this added complication and overhead is not worth the benefit of building a universal system.
It is also unclear to me exactly how the node directory (that is, the hash ring) is updated across the system as nodes are added or removed for the long term. Does the whole world stop while directories are updated and replicas are transferred to or from the arriving or departing node?
This paper is relevant to modern systems because it presents the design of a large-scale system that has received much industrial use and has so far met its ambitious design goals.
Posted by: Seth Pollen | February 1, 2012 10:13 PM
This paper was about Dynamo, a distributed key-value store developed and used by Amazon.com.
The primary design constraints of Dynamo have to do with performance and reliability. Dynamo must obey a service level agreement (SLA) specifying maximum latency and minimum availibility for the 99.9th percentile of requests. To this end, the creators of Dynamo set out to make a system that was incrementally scalable, reliable, and fast. They exploited features such as consistent hashing, replication, vector clocks, and others to accomplish this. Dynamo makes few, if any, assumptions about network reliability or the status of individual nodes. Each node is non-distinguished, meaning there is no difference in terms of responsibility between different nodes, and each node can serve in any capacity needed.
Dynamo provides two operations, get() and put(), to manipulate key values. In order to provide higher availability, the authors designed Dynamo to be merely eventually consistent. This means that it is possible for a read to encounter conflicting version of a key-value pair. In this case, Dynamo leaves it to the client to resolve the conflict and update the new value.
One possible flaw with this paper is that the authors claim that Dynamo is intended to take advantage of hetergeneity but, as far as I can tell, they do nothing to establish that it does. It appears, from the description of the partitioning schemes, that the relative capabilities of individual nodes do not affect their assignments to tokens, and further the relative popularity of the keys does not affect how many replicas there are (although this latter point is acknowledged by the authors and I think they make a fair point that, for most of Dynamo's intended applications, it is rare for there to be large differences in key popularity). The only way I can see that it would matter is with the load balancer for incoming reads. However, this strikes me as a weak case for the idea that Dynamo takes advantage of heterogeneity. It seems to me that a better way to do that would be to let more powerful nodes take on more tokens.
Another flaw with Dynamo is that is doesn't seem to have an easy way to garbage collect keys that are no longer needed. It is up to the client to keep track of such keys and delete them when necessary. Also, it wasn't entirely clear whether a given instance of Dynamo is intended to be used with multiple applications, but if so, it would be nice to be able to tag keys with metadata that would allow an administrator to easily clear out all the keys associated with an application (e.g., because the application is no longer being used or has changed so that it no longer needs the old keys).
Posted by: James Paton | February 1, 2012 08:52 PM
In this paper, the author comprehensively described the Dynamo, a highly available key-value store system. Dynamo is a very key-value store system. Compared with complicated DBMS with ACID properties, it’s more flexible and suitable for many services in Amazon. It’s featured as highly available and scalable. To achieve this target, it makes use of different techniques and also it sacrifices some temporary consistency in a few case which will eventually become consistent.
To make the system scalable incrementally, Dynamo partitioned the data by using consistent hashing. They made a modification to the original consistent hashing by mapping one server to many virtual node on the hash ring, so that the influence of add/remove a node can be distributed to all the rest nodes and won’t cause chain reaction to damage the system further. In an addition, to ensure uniform load distribution, the partition scheme strategy has been evolving.
Due to the requirement of the Amazon services, writing data should be accepted under any circumstances. Vector clock is used to mark different updates to data so that they can be reconciled later. Replications are used frequently in Dynamo to achieve high availability during partial failures. To maintain the consistency among the replicas, it uses a consistency protocol similar to that in quorum system. It has two configurable arguments: R and W. They can be used to balance read/write performance and data consistency. For example, they can set R to be 1 and W to be N. So that they can achieve the best read performance.
To reduce the read/write latency, they made some optimization. First they buffered the block writing whose response is needed for the server to return, and write through another unblock server to make sure that data will not be lost if failure occurred to the buffered data. Second they use client-driven instead of server-driven coordination so that they can avoid extra network hop and reduce latency. But I think it may cause security problems to expose the state machine to the clients.
Another important point of Dynamo system is that, itself doesn’t deal with the divergent version of data caused by concurrent updates. It has two advantanges. On one side, it reduces the complexity of the system and make it easy to recover. On the other side, it gives the developer more flexibility and let them choose a business logic specific reconciliation strategy that is most suitable for them. For example, for the shopping cart data, they may choose to merge two different version.
In conclusion, Dynamo is a very good key-value pair storage system with high availability and scalability. If I have to find some flaw, I would say it uses many techniques to achieve the target. These techniques itself may limit the scalability of the system.
Posted by: Xiaoming Shi | February 1, 2012 08:03 PM
The paper talks about Amazon's data infrastructure called Dynamo which was
popularly used by a lot of services that were offered by the company. It was
aimed at services that require very high availability and prone to regular
failures of network and machines, but can tolerate some level of consistency
at the application level. The paper describes the system in a high level and
talks about the lessons that were learnt in building and using Dynamo.
Dynamo uses a lot of concepts to achieve its goals. It does not require the
ACID properties and hence is essentially a distributed key value store with
additional context information. Hence all operations are implemented as get
and put operations. It uses consistent hashing with replication to make it
highly available. Dynamo promises eventual consistency by using vector clocks
for each update. The application software is expected to be aware of this and
work around that. To handle failures, it uses a combination of hinted-hand-off
and replica synchronization depending on the scenario. To make the
synchronization between replicas efficient, it uses Merkle trees over a range
of keys. Reads or writes can be handled by any node in the system and they are
handed over to one of the available nodes marked as responsible for the data.
It uses a sloppy quorum where the set of nodes responsible can change
dynamically depending on the availability. By adjusting the parameters of the
quorum, clients can achieve varied levels of performance, availability and
reliability. It uses versioning since any node can handle a write and a
co-ordinator node is responsible of merging versions if possible. To account
for temporary failures, nodes are only explicitly removed from the ring by
administrators manually and it uses a gossip based protocol to propagate the
changes.
The important contribution of the paper is describing a very successful
distributed system used in production that uses a wide range of computer
science concepts. It sort of highlights the CAP theorem and most big companies
these days have something similar to this in their repertoire. By providing a
bunch of knobs to turn around, Dynamo can be used across a wide range of
client applications that can tolerate eventual consistency.
The paper however lacks in providing some notion of how geographically
separated these nodes were. Typically, data centers are spread across the
cities and some replication is done across data centers. Their measure of
failure (divergence in versions) sounds like it was done probably on a set of nodes within a single
data center with very less failure rate. This might not have completely
triggered and exposed the problems in any of the costly operations like
replica synchronization or write co-ordination.
Posted by: Sathya Gunasekar | February 1, 2012 07:40 PM