« Paxos Made Simple | Main | . Chord: A scalable peer-to-peer lookup service for internet applications »

Paxos Made Live – An Engineering Perspective

Tushar Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live – An Engineering Perspective. PODC '07: 26th ACM Symposium on Principles of Distributed Computing, 2007.

Reviews due Tuesday, 3/22.

Comments

Summary: In an effort to resolve fault-tolerance problems experienced with a third-party database, Google developed a database built on a fault-tolerant log using the Paxos replication algorithm. Several engineering problems encountered in building Paxos into a production-ready system are discussed, particularly involving testing and code maintenance.

Problem: Google has a distributed locking service used by their file and database systems which previously ran on a third-party database product. This database was not known to use a thoroughly-vetted replication algorithm, and hence exhibited replication problems. To build a proprietary replacement system, the Paxos algorithm was selected. Paxos allows replicas to agree on entries in a log, a primitive which can be built into a fault-tolerant database. Even though previous work had analyzed the Paxos algorithm and presented applications, new challenges had to be dealt with in implementing a production system. The previous cited applications of Paxos were prototypes developed by research labs, whereas Google uses its system in the infrastructure behind one of the world's most popular websites. Achieving this level of production-readiness required a great deal of error-handling code, which distracted from the actual replication algorithm. Furthermore, verifying this system using theoretical proof techniques would be impractical, so a new testing method had to be developed.

Solution: To manage the complexity involved in creating a production-quality fault-tolerant database, the authors used a few new engineering techniques that had not been applied to previous implementations of Paxos. First, they created specification language at a higher level than C++ which they used to encode the state machine to carry out the Multi-Paxos algorithm. This concise language allowed the entire algorithm to be displayed in a relatively short amount of space, while a translator compiled the specification into C++. The production C++ code includes numerous runtime consistency checks, which help diagnose corner-case errors, but which would distract from the algorithm were it not for the specification language. A second important engineering aspect of the system is the ability to re-run certain test cases when errors occur in order to diagnose the precise cause of the fault. Production-level data may trigger unlikely corner cases in the code that expose faults, so it is important to be able to re-create the fault in a controlled setting where it can be debugged.

Flaws: One potential flaw in this engineering methodology is the overhead that may be caused by excessive runtime checks. In such a high-volume setting as a Google data center, even the slight performance hit from runtime checks may be a significant cost. Perhaps the system could run in one of two modes, a checked mode and a check-free mode, somewhat like what the Windows kernel provides for driver developers. However, a benefit of these checks is the ability to help find rare errors that may only appear after long production runs. So, there seems to be a trade-off between safety and speed. A second flaw, as pointed out by the authors, is the inability to re-create error conditions that occur during concurrent operations. This is another example of such a trade-off between correctness and performance, one were the gap between the two ideals seems to be widening as the system is developed and more concurrency is necessary.

Applicability: The important contributions of this paper are not so much the implementation details of a Paxos-derived system, but in the engineering practices described that help provide all of high-performance, maintainability, and correctness. Readers interested in a Paxos system specifically will of course pay attention to the several tricks discussed to optimize the Paxos algorithm, but the main techniques described have a much broader applicability. Google and Facebook both use high-level specification languages to generate RPC code, for example. The ability to seed a random number generator with a fixed value is a common technique used for debugging randomized algorithms. These principles are largely common sense, but can vastly simplify engineering in the long run even though they may incur a non-trivial set-up cost.

Summary
This paper presents the experience implementing Paxos algorithm inside Chubby, a fault-tolerant system at Google, to ensure mutually consistency between replicas. It utilizes master lease mechanism to reduce the number of logging messages, which reduces the disk flush time. In addition, it uses the combination of checksum, snapshot and log to handle disk corruptions as well as slow replicas problem.

Problem
1. In reality, crash replicas or slow replicas might miss some Paxos instances. To ensure they have the most recent data as other replicas, a catch-up mechanism is needed.
2. In close network proximity, Paxos algorithm requires every replica to sends propose, promise, accept, acknowledgment, and comment messages in each instance. Because we log those messages for replay and catch-up mechanism, disk flush time could dominate the latency.

Contribution
1. To implement catch-up mechanism, this paper utilizes both log and snapshot techniques. Each repica logs its state to the disk before sending them, when a replica crashes and recovers later, it can replay the persistent log to gain the state before crashing. It also creates snapshot to reduce the recovery time when a log becomes large. When a lagging replica wants to catch up, it can either pull the recent snapshot from the leading replica or read from GFS.
2. To reduce the disk overhead of logging, it changes Paxos algorithm to prefer the current coordinator. The current coordinator holds a lease so that it can act as a coordinator in a relative long time. And current coordinator can renew its lease as long as it is alive. During the lease, other replicas cannot succeed to submit values to Paxos, which reduces the number of propose, promise messages and only accept and acknowledgment messages should be logged.

Flaw
1. The paper doesn't discuss how Chubby maintains the group membership information in Multi-Paxos. I believe it's hard to do it in some complicated situation like master temporarily disconnects and return after new master is elected.
2. Because Chubby prefers to use the same machine as master for a long time, then it's possible that the master becomes a bottleneck. For example, if the master is used to doing other CPU intensive jobs after it gains a lease, the whole system will be affected. So it's better to have a mechanism to reduce master's chance to get a lease if the master doesn't perform well.

Application
Paxos is gaining more and more recognitions. Chubby is a good application example for it. Paxos is also implemented inside Google Megastore to provide high availability replication.

Summary:

This paper presents the design for a fault-tolerant database built on a log system that utilizes Paxos to provide fault tolerance. The design makes use of the Multi-Paxos optimization as well as introducing a few novel mechanisms for realizing Paxos such as client-triggered snapshots.

Problem Description:

The presented system addresses the problem of implementing Paxos in a production system. This problem is important because Paxos provides some attractive theoretical results. However, as the authors describe, non-trivial effort is required to implement theoretical Paxos in a real system because the theoretical Paxos work doesn’t focus on practical limitations such as handling hardware disk failures, limited disk space and system testing. Earlier work addresses the pieces required to implement a practical Paxos system, but no work as described the process of combining the pieces.

Contributions Summary:

I think the most significant contribution of this work is the software engineering principles used to build the fault-tolerant system. The authors describe that they used a state machine language and compiler to generate the replicated state machine run at each node. They comment on the benefits of separating the state machine from the rest of the system. The authors also included a fair amount of run-time consistency checks to verify that the system is actually providing the services it specifies. The authors indicate that this helps protect against bugs, which are often unaddressed in Paxos literature. Finally, the authors built the system to be tested; their results show that this design decision was clearly beneficial.

Another contribution of this paper is that a system that uses Paxos doesn’t necessarily have poor performance. They demonstrate that their designed system has significant performance improvements over their previous system, Chubby + 3DB, that did not utilize Paxos.

Shortcomings:

I think the most glaring shortcoming of this paper is that their testing framework does not appear to have a mode to test multi-threaded operation. The authors state that repeatability was an important characteristic of their testing framework. However, after presenting the testing framework, they state that most of their components eventually needed to be multi-threaded. Despite how useful their testing framework may seem, I can’t help but wonder how effective it is given that it doesn’t run the system in a multi-threaded mode.

Additionally, the evaluation of the system presented in the paper is lacking. However, I suppose this is an industry paper so in-depth evaluation cannot be expected.

Application to real systems:

As the design presented in this paper is a real system, I think all the ideas presented by the authors are relevant when designing a real system. Particularly interesting to real systems is the discussion of the unexpected failures observed in the 100 machine years of operation. This discussion contains valuable lessons that will most definitely be applicable to other systems.

Summary: Chubby, a distributed locking mechanism used at Google, uses a Paxos implementation to provide consensus amongst replicas. The paper focuses on the challenges that arise and the techniques used to address them in an actual Paxos implementation.

Problem: Realizing an implementation of the Paxos algorithm presents a significant challenge. While the Paxos algorithm can be expressed in a single page of psuedo-code, implementing the algorithm in C++ requires thousands of lines of code. Furthermore, an implementation must be deal with real-world issues such as disk corruption, serving read operations from the master, updating group memberships, reclaiming log space, and providing a database transactions interface. A system used in a production environment must also meet high SLA requirements.

Contributions: Chubby includes several key designs to make an implementation of Paxos feasible and functional. Two of the important designs are master leases and snapshots. Master leases address the issue of serving read requests from the master's copy of the data structure. Original Paxos requires serializing all read operations, an approach that is too expensive for Chubby because it does not allow the parallelism required for high throughput. The solution is to assign a master a lease that guarantees other replicas cannot submit values for entry into the log and data structure, which would no longer guarantee the master has the most up to date data structure.

The snapshot mechanism included in Chubby addresses the finite space limitation on log size. In practice, the log cannot grow indefinitely. Taking a snapshot of the data structure, allows Chubby to include a handle to the snapshot in its log and reclaim the space from old log entries. Extra coordination and complexity is avoided by having each replica snapshot when they choose and keeping the snapshot local.

Flaws: The use of a custom compiler, to enable the core algorithm to be written in a state machine specification language, is a flaw in the implementation of Chubby. While the state machine specification layer makes understanding the algorithm easier, it adds an extra layer of code to debug, namely, the state machine compiler. The state machine compiler makes the potential for flaws just as likely as coding the core algorithm directly in C++.

Applicability: Chubby is used (or at least was used) by Google for distributed locking. However, the Paxos implementation can be separated from the reset of Chubby and applied in other production systems.

Summary: This Google-produced paper describes the challenges met when applying the Paxos consistency algorithm to improve upon the implementation of the Chubby distributed database system.

Problem: The main purpose of this paper is to describe the challenges that are inherent in developing a live distributed system, but are unforeseen, ignored or otherwise glossed-over by researchers engaged in theoretical analysis. Particularly problematic is that when a physical implementation reaches a certain level of complexity, the option of conclusively proving correctness is no longer computationally feasible.

Contributions: To address the problem of unprovable correctness, the developers make use of repeatable automated testing to verify the correctness of their implementation. The explanation of the intent, design and implementation of these tests is the most significant contribution of this paper, and they are able to show that many obscure bugs can be found through an unrelenting application of randomized but repeatable test. There is even a willingness to restrict the functionality of the system (by limiting the use of multi-threading) in order to maintain repeatability; this emphasizes the importance of such tests in the creation of a reliable distributed system.

Flaws: The narrative presented in this paper has an air of naive incredulity at the notion that a distributed system could be so difficult to implement. The authors suggest that the research community should set out to "solve" the fault tolerance problem, as has been done with language parsing in compilers. There seems to be no acknowledgment that, under general goals and failure models, the problem they are asking to be solved has been proven to be unsolvable. Distributed systems will always need to be implemented with application-specific requirements in mind, not with a one-size-fits-all theoretical model.

Applicability: One of the most valuable lessons that any software developer must learn in his or her career is that implementation of any design is inevitably way more difficult than it would seem in theory. In describing the transformation from theory to practice, this paper drives home that point as well as any other that I have read. The difficulties encountered during the development of this system highlight the universal truths that all non-trivial tasks will contain unforeseen pitfalls, and that most tasks that seem trivial probably aren't really.

Post a comment