CS 739 Review Blog: Part-time Parliament/Paxos

The Part time parliament

Problem Description
Achieving consensus in a consistent manner in the presence of faulty or unresponsive processes and network links. The solution should also have the desired property of forward progress in the presence of a majority.

Summary
The paper first explains a preliminary and basic form of Synod protocol upon which the Part time parliament protocol is based. The preliminary protocol guarantees consistency and progress in passing a decree (mapped to an update operation in distributed systems). Replicas of the passed degrees are maintained in a ledger in this problem. Basic form of the same protocol reduces the state that is to be maintained compared to the preliminary version and sacrifices progress in bargain. The problem had also modeled network links in terms of messengers and faults in them is also modeled

The Part time parliament is an extension of the Synod protocol in which multiple decrees are passed. The protocol carries over the consistency and progress properties from the Synod protocol . It also has additional properties –decree ordering property and monotonicity. Minor extensions to accommodate a few situations were also presented – whom initiates the protocol, whom to contact about a decree(basically from which system do u read from), how frequently the processes update their replica, adding new machines , correcting inconsistent states

Contribution
1. An cost effective way to implement arbitrary state machine
2. The problem has modeled numerous scenarios prevalent in real time distributed systems – such as link failures, process failures, frequency of updates, choosing the initiator, whom to contact for certain read, adding new machines, correcting in consistent state. The solutions to these problems can be adopted.

Relevance to distributed systems
The problem has an elaborate model of distributed system to accommodate real time failures such as network link failures, process failures and Byzantine failures. The protocol provides a robust and less expensive (than earlier proposed state machine algorithms) way to implement an arbitrary state machine. Consistency is maintained despite failures in this protocol. It does not guarantee bounded time response- hence suits systems with modest reliability requirements. Problems of synchronization and fault tolerance are inherently handled by the protocol

Posted by: Ramya Olichandran | March 9, 2010 01:55 PM

I read and am reviewing “Paxos Made Simple”.

Summary:
The Paxos algorithm can be used to implement a fault-tolerant distributed system. This paper attempts to explain the consensus algorithm that is at the core of Paxos and then goes on to discuss the state-machine implementation.

Problem:
A fault-tolerant distributed system should be able to guarantee two things: safety and liveliness. As defined in this paper in terms of consensus, safety means that only a value that has been proposed will be chosen, only a single value is chosen, and a process will never think a different value than the actual chosen value has been chosen. Liveliness means that the state of processes running the consensus algorithm will eventually make progress. The algorithm must define a method to stop two proposing processes from bickering back and forth indefinitely over which value should be chosen.

Contributions:
The author discusses the Paxos algorithm in relatively plain English. The algorithm relies on three types of agents: proposers, acceptors, and learners. It is the role of the proposers to suggest values to be chosen. The acceptors must then come to a consensus on one of the proposed values. Finally it is the learner’s responsibility to discover which value has been chosen by the acceptors. The author also states that a process and messages may fail at any time or may be delayed for arbitrarily long, which makes the algorithm nontrivial to guarantee safety and liveliness. To implement this algorithm, the author begins moving through a set of requirements, each time expanding on previous requirements due to possible failures. The first requirement is that an acceptor must accept the first proposal sent to it. However, multiple proposers may be acting concurrently, meaning that a majority of acceptors will never agree, so the next step is to associate a number to each proposed value which increases over time. Acceptors then may accept multiple proposals of higher number as long as the value is the same and so forth. The author continues to iterate through requirements, adding steps such as the proposer asking for promises to not accept other numbers. The discussion leads to a two phase process: 1a) The proposer sends out a prepare request with a number n. 1b) If the acceptor receives a prepare request with a number n higher than any previous requests to which it has responded, it responds with a promise to not accept proposals less than n. 2a) The propose receives a promise from a majority of acceptors, then it sends out an accept request. 2b) The acceptors then receive the proposed value for the number n which they promised to accept. These steps ensure safety even with failures and arbitrary delays. The paper goes on to discuss the methods by which learners can learn of an accepted value, how liveliness can be guaranteed, and a state-machine implementation of the Paxos algorithm.

Application:
A (non-Byzantine) fault-tolerant distributed algorithm for consensus may be very useful in many settings. Anytime a particular job must be agreed upon to be executed next by a cluster of machines, this algorithm would be useful.

Posted by: Ryan Johnson | March 9, 2010 01:55 PM

Paxos Made Live

This paper takes the Paxos algorithm from theory to practice. It points out many of the engineering challenges not previously ironed out in the literature.

The problem at hand is maintaining a replicated database system on a network with both unreliable machines and network links.

The contributions are as follows:

1) Slightly modify the Paxos protocol to make it practical for actual use.

2) Define clean interfaces for interacting with the Paxos protocol (in this case Paxos is used for maintaining a distributed log which they consider to be a powerful primitive for fault-tolerant systems).

3) Reduce the number of disk writes on each iteration through master leases.

4) Code core algorithms as two state machines. This makes it easier to reason about the system.

5) Provide a heavy emphasis on testing and making tests repeatable and thorough.

6) Rip on the fault-tolerant computing community for their lack of focus on testing and real-world application. Ouch!

Paxos is a generic algorithm for maintaining a global state. It is claimed to be applicable to many applications. Would there be better, application specific mechanisms for maintaining a global state?

It is really hard to say how efficient this system is compared to other replication protocols. The statistics show that it can complete up to 640 operations per second when there are 20 workers. How many requests per second should a system like this be able to support? The other question is how does this scale? They mention that they ran this on a hundreds of Google nodes, how was the network traffic?

Posted by: Sean Crosby | March 9, 2010 01:54 PM

Paxos Made Simple:

In this paper, Paxos Made Simple, Leslie Lamport tries to explain Paxos in what he deems to be a simple manner. At least when he says many people believe the algorithm to be complicated because the original presentation was Greek to them, he is complaining about his own presentation of the algorithm. In any case, Lamport does describe the Paxos algorithm, how to reach consensus, and how to make progress in the system.

The safety requirements for consensus are only a value that has been proposed may be chosen, only a single value is chosen, a process never learns that a value has been chosen unless it actually has been. The system also assumes no byzantine faults. The goal here is to chose a single value, have the system agree on it, and make sure all nodes in the system learn the value. From these tasks, three roles in the system are created: proposers, acceptors, and learners.

In order to choose a value, some majority of the acceptor nodes must agree on the value. In order to do this, proposals are numbered with n, and nodes are sent a prepare request and they agree not to accept any requests with numbers less than n. If the proposer gets a majority to accept this, he can then broadcast the selected value to all other nodes. While this method ensures the safety of the data (based on the three conditions above), it is possible for the system to make no progress, if there are several proposers and they are all constantly proposing new values (which means the proposal number will be higher) and sending them out, no value will ever be agreed upon.

I understand the presentation of the Paxos algorithm much better after reading section 4 of the Paxos Made Live paper, in addition to this one. I think it's possible Lamport doesn't have a good grasp on what in simple. That aside, I know the Paxos algorithm was important and so I feel that this work is probably very important to distributed systems. I do think though that solution to the inability to make progress, which was suggested as having a single node be the proposer, is both good and bad. If the proposer fails, then time must be taken to elect a new one. If he doesn't fail, then all value selection is highly centralized which makes it much quicker.

Posted by: Elizabeth Soechting | March 9, 2010 01:53 PM

Paxos Made Simple

The Paxos Consensus algorithm is described for a system of multiple identical servers that must execute an identical state machinethrough identical paths.

The problem that the paper addresses is one of achieving consensus among multiple nodesthat each behave as proposers for the consensus value, acceptors of a value and learners of the value that reached consensus. In other words, the problem is of finding a distributed algorithm through which multiple nodes can agree on a common decision efficiently while ensuring the correctness of the algorithm and its resilience to faults.

In summary, the algorithm that the paper proposes requires each proposer to make a proposal with a new value only if no other value has been accepted by the acceptors that reply to its request. Only otherwise is the proposer allowed to propose a new value. Also acceptors accept values that are accompanies by proposal numbers that are strictly greater than the highest number that it has presently accepted. While these conditions ensure correctness, they do allow multiple proposers to get deadlocked. To prevent this, only one leader among the nodes is chosen as a proposer and only on its failure does some other node get to send proposals. This ensures progress in the consensus algorithm with the consensus being reached quickly and efficiently. Finally the paper describes how this algorithm would be used in implementing a state machine where the multiple servers (in the example used, the focus is on fault tolerance more than scalability so all the servers replicate the workload from each client) use Paxos to execute an identical state machine or more practically an identical set of commands.

With respect to the practical applications of any consensus algorithm, they would be huge with the growth of cloud computing. Many recent issues like those addressed by mainstream media about airline prices being deceptive to customers because of distributed databases all show the importance of consensus in such an environment. This is a good algorithm to implement wherever replication is taking place.

Posted by: Sanjay Bharadwaj | March 9, 2010 01:51 PM

Paxos made simple:

This paper explains the Paxos algorithm for achieving consensus among distributed nodes about a value in the system. This paper is a much simpler version of the original paper.

A collection of processes in the system called proposers propose values in the system. A group of acceptors accept one or more of these values, and one unique accepted value is chosen to be the outcome, and is made known to a set of interested processes called the learners. The Paxos algorithm proceeds in rounds, each consisting of two phases: The proposer sends a 'prepare' message with a number x to the acceptors. An acceptor promises that it wont accept any proposal numbered <x, and sends the largest number that it has
accepted so far. If a majority of the acceptors respond to the request, the proposer sends out an accept message with x and the highest value it received(or a random value if no acceptor accepted a value). An acceptor accepts a message with number x, unless it has promised not to do so. Acceptors inform a set of
distinguished learners about their acceptance, who then inform other learners.

The main contribution of the paper is the step by step expalanation of how the Paxos algorithm evolved. The paper did a very good job of this. They also discuss implementing this using a state machine.

Overall, I think the paper was written very well. It gave a good introduction of the well-known consensus problem. I would have liked a strong example of where this is used, although this is not the first paper on this
topic.

Posted by: Chitra | March 9, 2010 01:44 PM

Paxos Made Live

Summary and Description:
The paper describes how a complex real world system is constructed from a distributed, fault-tolerant algorithm. Paxos is an algorithm for achieving consensus among a set of unreliable nodes. In the paper, a team of engineers from Google decide to implement Paxos in order to construct a fault-tolerant replicated log, on top of which a fault-tolerant replicated database is constructed. While designing the system, the engineers discover many challenges while building the system which have not been sufficiently discussed in the literature. The paper describes how these challenges were overcome.

Contributions:
1. The set of challenges and their solutions described in the paper might be useful as a guideline in the future when a complex, distributed algorithm needs to be converted into a real-world system.
2. The paper highlights the absence of tools which can be used to easily construct real world systems from distributed algorithms, and the absense of research on testing for distributed systems.
3. The basic design of the distributed database over the fault-tolerant log provided by Paxos can be used as the basic architecture for any future systems with similar requirements. This would be beneficial, since a number of optimizations for such systems (E.g: snapshots and transactions) have already been studied and discussed in the paper.

Thoughts:
The paper does a good job of describing design choices and optimizations that were made in designing the new version of Chubby. It also proves that complex distributed algorithms can be really used efficiently in practical systems, although it would involve a lot of work. The paper also provides an insight into where research should be focused on for any real growth - on the development of tools for constructing distributed systems, not on newer/sturdier algorithms.

Posted by: Thanumalayan Sankaranarayana Pillai | March 9, 2010 01:15 PM

PAXOS MAKE SIMPLE
- Summary:
The paper present Paxos in a simple easy to understand form, together with an example showing the usage of PAXOS in implementing distributed system.

- Problem:
Consider a systems of 3 kinds of agents: proposers, accepters, and learners. Proposers propose values, accepters accept may proposed values, and learners learn the accepted value. The crux is making sure that a proposed value is eventually accepted, and the learners will eventually learn that value.

- Contribution:
I think the major contribution of Lamport in this paper is that he describe d the Consensus algorithm in a very "easy to understand" way. He also leads readers through simple proof of that algorithm. Basically, the algorithm consists of 2 phases:
* Phase 1: a proposer selects a proposal number n and send a prepare message (containing n) to acceptors. If the accepter receives a prepare message with proposal number n that is greater than any proposal numbers in its previously responded prepare message, it responses to that message guaranteeing that it won't accept message with proposal number less than n.
* Phase 2: if proposer receives prepare message from majority of the accepters, it send (n, v) accept message, with n is proposal number, v is value of the highest-numbered proposal among responses, or its own value, in case of no proposal. The accepter, once received an accept message, will decide whether accepts that proposal or not, based on the proposal number of prepare messages it has already response.

Then, in the learning phase, the accepters can send the accepted values to a set of distinguished learners, which in turn can inform all others learner about new accepted values. There is an issue of progress, which is solved by choosing a distinguished proposer. Lamport demonstrated the applicability of Paxos algorithm in building a state machine in a banking example.

Applicability:
- The algorithm guarantees that eventually a proposed value will be accepted, and that accepted value will be learnt. But I am concerned about how long will that process take. If you guarantee some thing eventually happens, but it take a long long time to go to that state, then it may not be useful in real implementation.

Posted by: Thanh Do | March 9, 2010 01:05 PM

"Paxos Made Simple"

Problem Description:-
---------------------
There are different ways to build replicated systems. We can have centralized control with a single master, but such a system is not fault tolerant. Hence, we need replicated state machine kind of model, which provides fault tolerance but makes life hard. Achieving consensus is important before performing any action is important in such a system.

Summary:-
----------
Paxos algorithm tries to achieve consensus in a distributed system by using a two-phase prepare/accept protocol.

Nodes in a distributed systems take the roles of proposers, acceptors and learners. Proposer send a prepare message numbered n to all the acceptors, and waits for a majority of them to respond. Acceptors repsond by returning the highest numbered proposal that they have accepted and a guarantee that they would not accept further proposals nubmered less than n. On a majority response, the proposer sends an accept message with a value derived from the responses of the acceptors. Learners participate in an algorithm to find out the consensus value. In the presence of multiple proposers, progress cannot be guaranteed. Hence, it is better to elect one of them as master, and to have a master re-election algorithm in the event of master failure.

The paper is again a classic Lamport style paper putting forth constraints and invariants and using them to guide the design of the solution.

Contributions:-
---------------
1) Theoretical solution to the consensus problem.
2) Master election algo as a simple use-case of the consensus problem.

Relevance:-
-----------
I believe that the consensus problem is more valid today as systems are being built truly distributed nowadays. Also, Paxos algorithm being well derived from the requirements of the problem provides correctness guarantees to rely upon.

Posted by: Laxman Visampalli | March 9, 2010 01:04 PM

Tha Paxos protocol aims at solving the problem of consensus in a distributed system. It is used for achieving fault-tolerance in a system with unreliable, heterogenous machines, and faulty networks. The problem being targeted is reaching an agreement on a value when there are multiple processes proposing multiple values in the above said environment. The "Paxos made simple" attempts to break down the algorithm into basic steps and builds on top of each step. It describes the set of requirements in such a system and finally arrives at an algorithm that satisfies all the requirements.

The consensus algorithm is split into two phases. In the first phase, the replica that wishes to act as a master sends a prepare request with its proposal number. Proposal numbers impose a total ordering among all the requests. A replica becomes the master if a majority of other replicas make a promise of ignoring requests from other proposers. After becoming the master, the proposed value is either its own value, or the one which the other replicas have already agreed on. This ensures that consensus is reached in a bounded amount of time.

Some of the points that I am still unclear about (from a practical standpoint) is how sequence numbers would be chosen and what would prevent a proposer from choosing an arbitrarily large proposal number so that it obtains the rights to be the master. It would have been interesting had some real-world implementation had been discussed along with the algorithm. The "Paxos made live" paper focused on this. The primary contribution of the paper was bridging the gap between the algorithm in theory and implementing it in a production system.

The paper discussed various implementation issues that the authors faced during the implementation of Paxos protocols while building Google Chubby, a fault tolerant database system with replication. Chubby uses the Paxos algorithm to achieve consensus on the entries to be entered into the fault-tolerant log. Some of the key ideas mentioned in the paper are master leases which attempts to maintain consistency by avoiding master changes during executions of Paxos instances. It also makes use of snapshots of the current data structures to limit the size of the log and to bound the time for recovery after a crash. Verifying the correctness and robustness of such systems is also a very challenging task given the different types of faults that can manifest. The paper discusses some approaches used like using a state machine specification language and multiple testing stages.

On the whole, the papers were very interesting to read and the algorithm seemed intuitive given the requirements. It also highlighted the difficulties in putting such methods into practice and gives an idea of the kind of problems that we could face when we go about implementing a system for the course project.

Posted by: Deepak Ramamurthi | March 9, 2010 12:14 PM

Paxos Made Live - An Engineering Perspective
Summary:
This paper describes how they built a fault-tolerent distributed database using the Paxos algorithm. It proposes solutions to various problems that they encountered in their implementation, and shows good measurement results.

Problem Description:
The Paxos consensus algorithm is one of the consensus algorithms that has been studied for long. The authors of this paper aims at building a fault-tolerance database base on the Paxos algorithm. However, there are signiﬁcant gaps between the description of the Paxos algorithm and the needs of real-world system. The goal of this paper is to give the experience of them solving the practical problems in their implementation and testing the system.

Contributions:
This paper describes various algorithmic and engineering problems, and their solutions. Because these problems are from their real experience, the paper provides very useful information for reference.
What I found interesting include: the implementation of master leases; the Epoch numbers which are used to reliably detect master turnover and abort operation; the ever-growing log problem and the snapshots to solve this problem; etc.
This paper also talks about the testing of the system, which is not fully discussed in previous works.

Applications:
This paper provides information to engineers on various issues in implementing the Paxos algorithm when building fault tolerance distributed systems. What I feel happy about this paper is that we can learn a lot of ideas for our course project, since we will implement something that is similar to Paxos, where we also have coordinators, and similar message passing.

Posted by: chong | March 9, 2010 12:04 PM

The Part-Time Parliament
Summary:
This paper tells us a great story on the functionality of a part-time parliament governed by righteous yet hedonistic people living hundred years ago, with their wise and cautious mathematicians easily solved one of the largest problems in distributed system: reaching consensus in a fault-tolerant environment, by designing a simple yet delicate Paxos protocol. The combination of the Paxos protocol and the president election protocol could guarantee both consistency and progress, which is important in lots of real world applications.

Problem Description:
Apart from the humorous language, the author is actually trying to solve a fundamental distributed system problem: reaching consensus in a fault-tolerant system. Here, consensus implies both consistency and progress. Why is this problem important? In addition to its real world application (will be discussed below), it also provides a new way to implement a consistent state machine in a fault tolerant environment. The algorithm “does not tolerate arbitrary, malicious failures… does not guarantee a bound-time”, but consistency does not rely on the 3m+1 loyal lieutenants as in the Byzantine problem. Therefore, this algorithm is suitable for a different demand and type of system.

Contributions:
The heart of the consensus problem lies in the Synod protocol, which is probably the largest contribution in this paper. In its simplest version (the preliminary protocol), the consistency is actually achieved by investigating what others have voted for before proposing a new ballot. Therefore, a minimum of 5 communications have to be made before a decree becomes known by everybody (The 5 communications are: Send NextBallot, Receive LastVote, Send BeginBallot, Receive Vote, Send Success). Although progress cannot be guaranteed in this situation, consistency is always made.

The progress could be guaranteed if there is a president who keeps proposing ballots in a maximum interval of 7 communications. Then there is a problem of electing a president. This paper also suggests a simple algorithm dealing with the election. Since the requirement of having only one president is not strict (having multiple presidents will delay but not block the pass of decrees), therefore simple algorithm will work as well.

For the sake of performance improvement, this paper also proposes other modifications, such as the multidecree protocol, and other application issues (hidden in the story), such as efficiently representing the decrees, the validity and expiration issues, etc. Although these modifications are important, they do not change the fundamental of the whole set of algorithms.

Applications:
This paper itself proposes two applications: the distributed consistent state machines and commit protocols in a fault-tolerant database. In fact, Paxos algorithm could be used in a wide variety of fault-tolerant situations. The advantage of Paxos algorithm (compared with Byzantine algorithm) is that: it does not require nodes to return something. Even if some nodes crash, the system could still progress, and whenever the crashed nodes return, they can learn the decrees. The disadvantage seems to be the unbound time of a decree to be guaranteed to be passed, hence not suitable in time-sensitive environments.

Posted by: Shengqi Zhu | March 9, 2010 11:41 AM

Paxos Made Simple:

The paper for Paxos Made Simple was indeed simple. It clearly explained the role of proposers and acceptors. There are distinct rules on how a proposer must propose, and who should propose. There are also set ways for how an acceptor may accept different proposals. In this way, that paper lays out the Paxos algorithm.

Consensus in distributed systems is a major problem. The Paxos algorithm comes up with a state machine style mechanism for achieving consensus between nodes. By proposing a "proposal number" to other nodes, and having them accept under certain conditions allows for consensus. The other issue with getting to acceptance is learning what others accepted. This is a completly different model, as it involves polling other nodes rather than making local decisions. It can simply be shown that both coming to accept a proposal and learning what was accepted is possible and progress is always made. These are the areas where the paper excels.

I did hope that the paper would try to apply the algorithm to a real system, so that the simple explanation would actually fit with something useful. It failed to compel a good use for the algorithm, rather just explained the basic points. I really couldn't figure how proposal numbers fit with anything, other than the process ordering that the paper attempted to get at. Really, agreeing on some value doesn't clearly point to anything in a system, but just some theoretical consensus of something. As such, the consensus is very possible, and happens in a given time, but it might not mean anything. So it should be shown that such a consensus on a proposal can be used to gain something useful.

I think if I had something in mind for a real system when reading this paper, it would be more helpful. But since I was just hoping learn what Paxos was all about, the simplicity of the presentation was generally lost. While simple, it was a concept that did not fit into anything, so it was hard to grasp. Fitting it to something real would have been much more useful.

Posted by: Jordan Walker | March 9, 2010 09:33 AM

Paxos Made Live
Problem
The problem described here is that how to implement Paxos for real world-app. Unlike pseudo code that less than a page, actual C++ code has several thousand lines of code. The reason is not lack of skill but missing gaps, which often ignored in academic paper. The paper comes up with missing gaps such as hard disk failures, memory failures, some of deployment failures. Also, testing of such a high availability system came into discussion.

Summary
The paper presents underlying algorithm that implements Google Chubby – distributed locking system. While Paxos is well-defined algorithm, there are missing features that should be considered in real world-app. The paper describes how to handle disk corruptions using checksums. Then, it describes master lease algorithm for optimizing normal usage. Epoch numbers are used to fill missing gap that master can transiently fail. Also, important optimization ‘snapshot’, which is basically memory dump, is presented. Other than optimization, the paper pitches on testing and deployment problems. They tried to avoid non-determinism as much as possible and devised elaborated test plan which uses failure injection.

Contribution and Applicability
While many so called ‘high availability’ paper asserts that their implementation is ‘simple’ and ‘practical’. Nevertheless, they failed in terms of practicality. However, Google Chubby at least had gain popularity inside Google. (In addition, Yahoo implemented Zoo Keeper that follows Chubby’s idea) The reason of popularity might be Google Chubby has simple interface. Although it was not described in this paper, Chubby’s interface is just plain file-system. In contrast, most of Paxos paper exposes a state machine. Converting existing algorithm into well-defined state machine is oftentimes non-trivial task. Sometimes, it is dreadful when existing system has complicated logics. In contrast, Chubby is easier to use. Rather than defining precise state machine, one can just adopt Chubby as reliable file storage. Google File System stores entire metadata, which can be seen as a set of small files, into Chubby. Also, one can easily implement various distributed locking algorithm, two-phase commit using file-system.
At the summary, the author pitches on that the academy should close the gaps between theory and practice as the compiler community did. This is nice assertion that academic person might be scared to write their own paper. In my opinion, a series of Lamport’s paper seems went too far on theoretical side without considering practice. (Please consider that I did not read ‘Paxos made simple’ yet)

Posted by: MinJae Hwang | March 9, 2010 09:18 AM

"Paxos Made Live"

Problem Description:
Replication is one of the techniques to achieve fault-tolernce which is commonly achieved using a consensus algorithm. Paxos is a family of protocols for solving consensus in a network of unreliable nodes. This paper discusses about the various practical issues in implementing a fault-tolerant replicated database (using a framework on top of Paxos algorithm) and handling faults like hard disk corruption, network failures, system crash.

Summary:
The paper explains the architecture of Chubby replica and explains and the practical issues in implementing the system. To detect hard disk failures, they propose checksum comparision technique and leaving a marker in GFS to indicate that its a corrupted one and not a new one. They also propose 'Master leases' approach to provide consistent read where they provide the master a lease for a duration which can be extended and till the master has the lease, it can service the read locally. The paper also proposes epoch number technique to reliably detect master turnover. It also proposes the snapshot technique to avoid an ever growing log. The paper also proposes the implementation algorithm for database transactions and discusses the various software methodologies implemented by them for developing and testing the fault-tolerant database.

Contributions:
1) This paper provides the practical implementation of Paxos algorithm
2) It provides insight to a lot of practical problems and scenarios like hard disk failure, inconsistency in read values, master turnover, network failures, system crash
3) Provides a consistent snapshot based algorithm to avoid huge logs.
4) Provides various test scenarios for the fault-based tolerance testing

Applicability:
Unlike the other theoritical papers, this paper discusses about the practical implementation and have explained the methodologies with proven results that the implementation is better. Also, they have tested a lot of real-time scenarios and proved that the system works fine. This paper is highly applicable in real-time and people can learn a lot out of this paper.

Posted by: Raja Ram Yadhav Ramakrishnan | March 9, 2010 08:55 AM

I read "Paxos Made Simple" because it seemed much simpler the the Part-Time Parliament paper and shorter. Given our previous experience with Lamport papers, I knew even the "simple" version would likely be complex and badly-written, and I was not disappointed :) I also read the Paxos Made Live paper because after reading the Simple paper, I couldn't get a clear idea about Paxos and wanted to see if and how it could be implemented. I got more about Paxos out of section 4 from the Live paper than the entire Simple paper!

The Simple paper is an attempt by Leslie Lamport to describe the Paxos algorithm in simple terms and to show how it derives directly from the properties it tries to enforce. The Live paper presents the experiences of three guys at Google who tried to build a service based on Paxos, and how it turned out to be much harder than originally imagined.

The Simple paper makes contributions in terms of giving a clear idea of why Paxos works the way it does, in addition to just describing the algorithm. The requirements are stated and invariants derived. To people who actually like theory, this must have been very interesting. It was still of interest to all others like me, because it showed what goals Lamport had in mind when designing Paxos.

The Live paper's biggest contribution is exposing the gap between theory and practice in distributed systems world, and calling the attention of researchers to developing testing tools for such systems. They detail how they solved a number of algorithmic and engineering problems, which will be of great interest to other people implementing similar systems.

I would say of the 2 papers, the Live paper has more applicability to real systems, because it has both the high level overview of Paxos, and the details interesting to developers. Paxos Made Simple is a good attempt to sell Paxos to more people, but ended up confusing me and making me wonder if it would actually work, given its complexity.

Posted by: Vijay Chidambaram Velayudhan Pillai | March 9, 2010 07:48 AM

Paxos Made Live:

Summary:
Paxos algorithm can be used in asynchronous systems to achieve consensus. The protocol is specifically designed to be fault-tolerant where machines can join(during recovery) or leave the group(on failures) dynamically. It guarantees safety and all the replicas have identical state, so achieves almost perfect consistency. The authors try to implement the algorithm on real system and highlight the protocol extensions and amount of work required to make it practical, although the basic algorithm is well established. The paper presents several algorithmic challenges encountered while implementing the algorithm and also presents better software engineering practices for better evaluation of the system.

The key algorithmic challenges stated are - To guarantee liveness by using master leases, periodically boosting the sequence number and making sure that coordinator change do not happen too frequently, using MultiOps to handle database transactions, using epoch numbers to handle coordinator failures while in midst of processing a request, handling disk corruption by using checksums, catch-up mechanism and efficiently handling snapshots.

Contributions:
Documenting the gaps between theoretical and practical work especially key problems faced and their solutions. Identifying the need for better testing and APIs for better ease of implementation in future work.

Designing state machine specification language and using a compiler to generate the fault tolerant algorithm.

Applicability to real systems:
This is useful in systems having strict consistency requirements. But building and testing the system seems to be a daunting task. Also, the system is evaluated purely on performance basis, and is designed to achieve close to 100% consistency. But as CAP theorem says, achieving 100% consistency should have impact on availability or tolerating partitions. So analysis with respect to these metrics for real workloads would have been helpful.

Posted by: Satish Kotha | March 9, 2010 07:38 AM

I chose "Paxos Made Live - An Engineering Perspective" becuase this engineering paper implies more on how we build a production system.

Summary:
This paper talks the experience and lessons learned from the implementation of an algorithm Paxos, which is used for addressing the fault-tolerant problem in distributed system. It actucally describes it process of converting a one-page pseudo into thousands lines of C++ code, as well as the challenges and the way they conquer them.

Problem:
The imperfections in the real world(e.g. hard disk failure, or finite resources) and the additional requirements (e.g. "master leases") make it a non-trivial endeavor to implement such a fault-tolerant log

Contribution:
- To address the hard disk failure problem: the author used checksum to detect the failure that file(s) contents may change and a new replica leave a marker in GFS to detect the failure that file(s) may become inaccessible. And then a replica with a corrupted disk can rebuild its state in a routinely way.

- To work around the risk of returning stale data involved in the read operation at the master, the author implemented master leases.

- To reliably detect master turnover and abort operations, the author introduced a global epoch number.

- To handle changes in the set of replicas, the author implemented "Group Membership" in the system.

- Since the repeated application of a consensus algorithm creates an unbounded amounts of logs, the author employed "snapshot" as an alternative way to record the sate of the data structure.

- Finally, the author tested the syatem in both safety mode and liveness mode. These tests cover the verification of the fault-tolerant log and proved to be useful verifies the fault-tolerant log.

Lessons from this paper:
- To convert a theoretical algorithm into a practical production system is difficult. You need not only throughly understand the algorithm, but also take into account much more aspects origines from the imperfection of hardware, the assumption behind the algorithm, the practical confinement with the implementation tools and the efficiency of implementation.

Posted by: Deng Liu | March 9, 2010 07:25 AM

review of “axos Made Live – An Engineering Perspective”
Summary:
The paper introduces google’s method of building fault-tolerant database using Paxos consensus algorithm. The authors guarantee the mutually consistent data on all replicas by applying the algorithm to build an identical log of input values on each replica. For the algorithm itself, the authors also adopt an increasing sequence number to assign an order to coordinators and use a catch-up mechanism to enable lagging replicas to catch up with leading replicas. When applying it to the practical system, the authors use a master leases mechanism to optimize read and guarantee consistency for write operation on master node, a global epoch number to guarantee the master status consistency and combine snapshot with log to reduce disk cost and reduce recovery time.

Problem Description:
Replication is an efficient way to achieve fault-tolerance on commodity hardware. However, due to the limitation of environment, like network delay, message loss, replicas failure and recovery etc, the replication will meet inconsistent problem in practical system.
In this project, the problem is the Chubby, a fault-tolerant system that provides distributed locking mechanism and stores small files in google, is based on a 3DB database, which shows many bugs in execution. They want to replace it with their own solution based on Paxos algorithm.

Contributions:
The main contribution is it provides experience to implement Paxos algorithm in a real distributed system. They also propose some interesting mechanisms:
1. Implementing master leases to optimize read while still guarantee write consistency for master node.
2. A global epoch number is used to guarantee the master status consistency
3. Combine snapshot with log to reduce disk cost and reduce recovery time.

Applicable to real system
I think this paper gives precious experience to build Paxos algorithm in a real distributed system. It points out those crucial problems in the implementation and their corresponding solutions.

Posted by: Ao Ma | March 9, 2010 07:23 AM

Problem addressed:
This is an experience paper where authors described the engineering effort made to realize a practical implementation of Paxos, a consensus protocol in the presence of faults. Their work makes many practical optimizations to enhance performance and also take into account challenges that arise out of non-ideal behavior of practical component of the system ( like corruption of non-volatile memory).

Summary:
This paper presents engineering aspect of implementing Paxos protocol for fault-tolerant consensus mechanism for Google's Distributed locking mechanism called Chubby. Practical implementation of the Paxos algorithm required to take into account of corruption/failure of the disk, unlike its theoretical counterpart which can assume fault free non-volatile medium to store persistent data. The authors addresses this issue by having check-sum with stored data and to identify total disk corruption, they left marker in GFS during start-up which can be later read to distinguish between disk failure and initial start-up. They use concept of Master lease to reduce the critical path for read operation. The basic idea is that if the master server is guaranteed to keep its status for at least a minimal time period (called lease), then reads can be serviced locally from master replica without consulting the other replicas. They also use the application specific snapshot of the Paxos protocol state to constrain the size of the log of the operations. This application-specific snapshot needed to be meaningful with respect to the Paxos protocol state captured by the framework.

Short summary:
This paper discusses engineering effort required to have a practical realization of Paxos, a distributed consensus protocol in presence of faults. The paper discusses issues and possible solution to practical implementation considerations like corruption of non-volatile media and several practical optimizations like Master lease to reduce critical path delay of read operations.

Relevance:
This work is relevant with respect to building large fault tolerant services and this paper explores many practical life implementation considerations and issues for realizing theoretical distributed consensus algorithm like Paxos. This paper brings forth the importance of bridging the gap between pseudo-code for theoretical algorithm for fault tolerant system and actual implementing issues and consideration, to realize the theoretical algorithm in practice. The paper also shows the importance of building algorithm and framework for testing these practical implementation of algorithms under real life constraints.

Posted by: Arkaprava Basu | March 9, 2010 06:20 AM

I read the paper "Paxos Made Live" as I wanted to know the amount of effort it takes to translate theoretical ideas into practical distributed systems. In the summary section, the authors share their concerns about the gap between the existing theoretical algorithmic work and the what is required by the developer community.

Problem:-
To design a fault tolerant database log for Chubby (a distributed locking mechanism) by creating a practical implementation of the Paxos consensus algorithm.

Summary :-
The paper presents the engineering challenges faced during the implementation of the existing Paxos consensus algorithm to create a fault tolerant log. It discusses the Paxos consensus algorithm and then presents some additional techniques used to create a functional system, such as the "catch up" mechanism, master lease and the heartbeat mechanism, the MultiOp primitive and the use of snapshots to discard the old logs. It also talks about some of the runtime problems and failures and their causes while running the system. The developers use a state machine specification language to create a C++ translation of the algorithm. The authors present some testing techniques for such systems (fault injection) to measure correctness. They also benchmark new system against the existing log implementation.

Contributions :-
It presents an actual implementation of the Paxos algorithm that is deployed in the Google data centers as part of the Chubby system. It also looks into some of the practical issues that are not considered in the algorithm. For example, the algorithm requires the entire log to be stored to replay the actions which is not possible due to limited disk space. The use of snapshots for solving this problem introduced another problem, i.e. consistency. Another contribution is the process of describing a fault tolerance algorithm using a specification language to remove any ambiguity due to imprecise definitions. The paper also states that it is very difficult to prove the correctness of a system in practice as compared to theory. It provides directions to system developers on how to test their systems, such as repeatability of faults for debugging purposes.

Applicability to real systems :-
This paper provides an example to developers that it usually takes a lot of effort to implement to an algorithm for distributed systems in practice. It shows that replication, fault tolerance, availability and consistency are non-trivial tasks and provides techniques to approach such problems by providing an engineer's perspective of such systems. Systems like distributed and parallel databases face similar problems and can gain a lot of insight through such studies.

Posted by: Ashish Patro | March 9, 2010 06:10 AM

Paxos Made Live - An Engineering Perspective
I chose to read this paper to share in the real life difficulties associated with distributed systems (and to get a break from Lamport's theoretical papers!).

Summary
The authors implemented a fault-tolerant log framework using the Paxos algorithm, which was then used to build a fault-tolerant database. The paper discusses numerous engineering challenges in implementing the Paxos algorithm, including real world failure handling, correctness, and optimizations.

Problem
A fault-tolerant system to provide distributed locking and small file writes, Chubby, was built using a questionable third party “fault-tolerant” database (3DB). 3DB had a history of bugs and did not appear to use a proven replication algorithm. The authors aimed to replace 3DB with a Paxos based system.

Contributions
The main contributions involve experiences with implementing the Paxos algorithm in a real distributed system. The authors point out several issues with the theoretical research. First, there’s often a disconnect between the faults assumed in the research and faults that occur, and need to be handled, in the real world. Second, due to extra optimizations and fault tolerance, it is difficult to verify algorithmic correctness. The authors employ a variety of practices and tests to increase confidence (none of which are new, but the attention to testing is admirable). Finally, the authors point out open problems in the fault-tolerance research community. The biggest charge is difficulty in implementing and verifying fault-tolerant systems. They suggest addressing this with better tools (pointing to the compiler community as a model).

Applicability
I think the benefit (applicability?) of this paper is helping other software engineers when they need to develop a distributed system. The paper points out numerous challenges and how they approached the problems. It can act as a lesson for future implementers. Most importantly, this paper shows that even a “simple” distributed algorithm (that is, short pseudo-code) is still highly complex when implemented in the real world.

Posted by: Jesse Benson | March 9, 2010 06:07 AM

For Part-time Parliament:

Lamport presents two algorithms in this paper. One is an algorithm to form a single consensus with unreliable nodes, the Synod protocol. The Synod protocol is equivalent to a three-phase commit protocol used in database transactions, except that it reaches an arbitrary decision.

The other algorithm achieves consensus for multiple decisions. The Paxon protocol is a simplification of what you would achieve by repeating the Synod protocol. At a high level, there are three operations: change the state, reliably (and slowly) read the current state synchronously, and to unreliably read the current state of some nodes, which may be out of date if they are not yet fully synchronized with the rest of the nodes.

There are only two places the system is not applicable. One is with respect to faults or security: the nodes are all expected to execute correctly, but at worst they may not receive or send messages. Another is with regard to topology, which is only partially addressed. If the president, the node in charge of presenting decrees, is inaccessible for some amount of time, then a new one can be chosen by some ordering of the nodes. The author also addresses how a node might be replaced. One way is to replicate the state of the old node onto the new node, which works fine. The other method requires that the nodes actually exist. If a majority of nodes are perpetually unresponsive, the system will fail to ever make progress.

The Paxon algorithms do not seem to make good progress unless a majority of nodes are in all quorums. It is certainly possible that a decision can be made with less than the majority, but condition B2 seems to make it very unlikely. If you require a majority for all quorums, then B2 is always true. If a quorum concists of just a small percentage of the nodes, then at least one of those must remain up for each ballot until a consensus is made: something unlikely. It's strange then that Lamport gives examples where a set of legislators will be in the parliament, and leave while another set of legislators enter. This violates the majority-B2 suggestion, making it potentially difficult to make headway.

I would imagine DNS systems could benefit from the Paxon algorithm. From what I remember reading, replicating DNS entries automatically cannot be done well. Most likely, this is only feasible for a gorup of (how do you say) single-site DNS servers. You may be in the business of shuffling hosts around at an alarming rate, like can be done at data centers that concist of virtual machines. I tiered hierarchy is much simpler, though. DNS is a good example, though, because it allows room for error. Data storage systems are obviously applicable, but nobody would accept an out-of-date file repository.

Μαρκος

Posted by: Markus Peloquin | March 9, 2010 06:01 AM

This paper "Paxos made Simple" presents consensus algorithm for distributed systems so that a one particular value is chosen among a set of process. The consensus is reached in the system by communication between three logical agents: - proposers, acceptors and learners.

The contributions of this paper is logical deduction of the consensus algorithm starting from a simple single acceptor agent to a set of agents employing two-phase protocol. The focus of the algorithm is not only to reach a consensus value but also a mechanism for the system to remember such a chosen value in the event of certain agent failures. The paxos algorithm has two phases: - prepare and accept. Each proposal contains a unique increasing number and a value. The proposer initiates prepare phase with number n to learn the highest numbered proposal that has been accepted as well as get a guarantee from acceptor that no proposal numbered less than n would be accepted. An acceptor might as well ignore a prepare request. In case of response from majority of acceptors the proposer then initiates accept phase with a value v which is the value of the highest-numbered proposal among the responses. The acceptor accepts request unless it has already responded to a prepare request having a number greater than n. A value is chosen at proposal number n iff majority of acceptor accept that value proposed in phase 2.

One of the applications of such a consensus, would be in the parallel database systems where a consensus for a transaction would be required to be reached among different nodes in the system as to whether a commit / abort has to be performed. In such cases the algorithm could be extended by requiring the transaction commit to be consensus value, required to be accepted by all nodes. However for systems which provide partial services, in the presence of few partitioned nodes, it seems that consensus may not be reached at all until majority of acceptor nodes are active. In such cases the total nodes in the system among which consensus have to be reached becomes variant, which is not inherent in paxos algorithm. The algorithm's effectiveness also depends on the number of nodes that are designated as acceptors and proposers. This is because of the amount of messages being exchanged between proposers and acceptors grows rapidly with increasing proposers. Hence it may be advantageous to designate some of nodes as passive acceptors rather than making every node in the system assume all the three roles. The passive acceptor could then accept the incoming proposals based on the additional information of its local state.

Posted by: Rajiv | March 9, 2010 05:16 AM

I decided to go with Paxos Made Simple because I wanted to see if Leslie could really write something simple and easy to understand.

The focus of this paper is to better explain the "simplest and most obvious of distributed algorithms." The algorithm's description, requirments, and guarentess are laid out in plain english.

The problem, of course, being that the orignial paper was written in greek, an language. This made it difficult for ordinary people to understand which in turn undermined the usefulness and impact of the paper. (jk) It boils down to coming to a consensus on a single, actually proposed value eventually. Also no value should be learned if it wasn't the subject of a consensus.

In the progress section the importance of having a distinguished proposer is addressed. There is a case where two proposers could keep incrementing the proposal number one after the other ensuring that neither will ever get a proposal excepted. I'm not a huge fan of naming unique nodes or introducing single points of failure into distributed systems. It doesn't seem to be in the right spirit to designate commanders or leaders of systems regardless of their ability simplify implementations.

While I'm sure its not directly applicable to real implementations, I'm sure with a little tweaking it could work.

Posted by: Jeff Roller | March 9, 2010 04:39 AM

The paper "Paxos Made Simple" presents the paxos algorithm, in a manner that is less cute than in "The Part-Time Parliament" but far more understandable and containing less Greek. Paxos is an algorithm for keeping machines in the same state even when finding a consensus is difficult due to machines crashing and coming back. To do this, each one must agree on a certain set of events in a specific order.

The problem that the paper addresses is that if there is a distributed system of machines, it is often desirable for them to be replicated state machines. In order for this to work, each machine must agree on the transitions and their order. The paxos algorithm allows a consensus to be reached, even though at any given moment, there may be some nodes that are not reachable. There are several safety requirements to be met: only a proposed value can be chosen, all nodes must agree on the same value, and nodes only learn that a value has been chosen if it really has been. There is also a property that the algorithm should make progress, except in some rare situations like choosing multiple new leaders after the old one has failed.

The paper contributes a practical algorithm for keeping a replicated system consistent. It consists of proposers, acceptors, and learners. One of the nodes can be the leader, who is both the distinguished proposer and the distinguished learner. The nodes can have different weights, for example if some are expected to be more reliable than others. Proposers send proposals to quorums of acceptors; quorums must be a majority of participating acceptors.

The paper does not address byzantine failures; it limits the ways that machines can behave badly to just crashing and restarting. It also assumes that information can be stored in persistent storage and that it will survive a crash. Still, this algorithm seems very applicable to actual systems, and I think there are several existing systems that actually use it.

Posted by: Lena Olson | March 9, 2010 04:21 AM

"Paxos Made Live - An Engineering Perspective"

Summary:
This paper describes the implementation of the Paxos consensus algorithm for use by a fault-tolerant database service at Google. The paper first gives background on Paxos. It then describes the numerous challenges the authors encountered in its implementatino. The authors also describe their software engineering practices and how those impacted the development of the algorithm's implementation. Finally, the authors present quantitative results of their deployed system.

Problem description:
For over a decade, Paxos has been a commonly cited algorithm for achieving consensus in a distributed environment. Yet actual, industrial-strength implementations of the algorithm are apparently few -- certainly there is no abundance of literature on the subject or the authors would have cited it. This paper attempts to fill this void by describing in detail the experiences of a team of developers at Google in developing the algorithm's implementation for use in their clusters.

Contributions:
This paper has several contributions: First, it describes the challenges of implementing Paxos in a large-scale, real-world cluster setting. Second, the paper provides a very detailed account of the authors' own experiences deploying their Paxos implementation. Such an account is exceedingly valuable to any other developer looking to deploy Paxos. Finally, the authors list a number of shortcomings in the field of distributed computing that make the development of a distributed computing protocol so challenging. The authors urge the wider community to put more effort into developing tools for implementing and testing distributed algorithms and their implementations.

Applicability:
The applicability of this paper is evidenced by its source. Google uses -- or in any case used -- Paxos internally to implement a fault-tolerant consensus algorithm in their clusters. The "lessons learned" component of the paper has clear applicability as well, to any who seek to follow in their footsteps. The likelihood that that others will follow in their footsteps seems high, as distributed computing is becoming increasingly more common and necessary for scalable internet services.

Posted by: Marc de Kruijf | March 9, 2010 04:17 AM

This paper (Paxos made simple) explains the original paxos algorithm/protocol in a much simpler terms.

Paxos protocol is for coming to a consensus among the network of unreliable nodes. The main challenge is that the nodes can experience failure at any point of time or the communication medium between the nodes are not reliable or the nodes can revive after failure and come back into the network. The state machine approach had been used along with this consensus algorithm to enhance the fault tolerance of the system. The protocol ensures progress/liveness and consistency.

The main protocol is divided in to two phases. In phase 1, the proposer tries to get the promise of the acceptors for not to accept any of the proposals whose number is less than that of the one that is being made now. This is done by sending the prepare request by the proposer to the acceptors. In phase 2, the proposer analyses the values that it obtained in response to the prepare messages and chooses the value of the highest numbered proposal less than the current proposal number in consideration. The chosen value is sent to the acceptors through accept requests if it had to not promised to any other proposal whose number is greater than the current one.

Various optimizations had been made in the original protocol like creating posts like leader, learner which helps in reducing the conflicts and thereby reducing the network contention. Also, the prepare messages can be avoided by the leader after the intial rounds.

Paxos is being used by many systems to maintain consistency across data and thereby producing fault-tolerant devices. Good number of companies/products using this protocol is given in wikipidea. The concept of majority acceptance (weighted nodes as mentioned in the original paper) has been used in many other algorithms too.

Posted by: Sankaralingam Panneerselvam | March 9, 2010 04:07 AM

I read the "Paxos Made Live - An engineering Perspective" as this describes a real world application of the consensus algorithm and the issues encountered in the implementation process

Summary :
The Paxos algorithm has been extended and improvised for a production database to acheive consensus among a set of replicas.

Problem Statement:
In a distributed system, it is necessary to enssure that all the data replicas agree on the values stored. Consensus algoritms are employed to attain this goal. While implementing these algorithms may appear simple in theory, the real world systems present a number of challenges. One of the main reasons for this is the simplifying models assumed for theoretical purposes like the fault models.

Contrubutions:
Google's Chubby provides replicated fault tolerant distributed locking and storage.

- Implementing master leases to optimise the reads as reads form a large portion of the DB operations.Thus the master has the most recent information most of the time and the reads are not serialized wrt updates.

- Using snapshots to limit the size of the replicated logs abd also the recovery time for a recovering replica

Applications:
This paper demonstrates how Paxos algorithm has been applied to many systems including the databases to implement conses and acheive consistency over all the replicas.

Posted by: Neel Kamal M | March 9, 2010 04:06 AM

Review of Paxos Made Simple

The Paxos algorithm gets a set of processes to reach consensus on a single proposed value. It can work in an asynchronous network without byzantine assumptions. It tolerates failures and handles real world problems like replication, delay, and dropping problems of message passing. It doesn’t require the guarantee that all received messages are correct.

We always use lock to manage concurrency. But in a distributed system, processes can’t issue lock directly. They have to elect a master by reaching consensus and also handle failures. This paper need to answer the question how to reach consensus in distributed system that have failure tolerance.

In the Paxos algorithm, Processes can play three roles: proposers, acceptors and learners. Processes can play one, two, or all three roles. It is easier to understand if we think these roles as being separate. Safety requirements must be met. A simple and straightforward algorithm is derived from those requirements. In the first phase, proposers send prepare requests to acceptors. A proposal consists of a unique identifier and a proposed value. When an acceptor receives a proposal, it disagrees to accept any proposal numbered less than the one it received. The acceptor notifies the proposer if it has agreed to only accept higher numbered proposals. If a proposer gets promises from most of acceptors, it will send an accept request. The acceptors will accept the proposal assuming accepting it will not lead to any violation of any previously made promises. When an acceptor accepts a value, learners are notified and keeping track of the state of the system and the chosen value. But we can still be unable to guarantee that Paxos algorithm to converge to an answer. A failure of a majority of acceptor agents will lead to diverge. Paxos handles multiple fail-stop failures correctly. It imposes only “essentially optimal” network load. It can also be used to keep a collection of state machines consistent.

Paxos aims to arrive at a consensus safely, but not to consider one proposal supervisor over another. It can be used to implement state machine and also very applicable to elect a machine to be master or take the token.

Posted by: Lizhu | March 9, 2010 03:33 AM

Goal:
This paper shows how Paxos algorithm is extended and implemented in a real production system. These extensions were made to increase system throughput and address various kind of failures in real world usage.

Problem:
Most theoretical distributed fault-tolerance algorithm is simple to describe. However, it might be complex to implement due to some constraint in production system. This raises the possibility that the implementation is not correct because it usable to use proofing method to verify the system.

Contribution:
This paper describes the reimplementation of Google’s Chubby by using Paxos algorithm to implement a fault-tolerance log which is used to provide database replication. Then, Chubby exposes its fault-tolerance capability to its client as a distributed lock service. Various contributions are made as part of the system design that made Paxos practical for production usage.

Paxos require disk write at the end of the phase of each Paxos instance. The propose phase of the algorithm is removed by using a master lease to establish a single master (proposer) in the system. Thus, client library ensure that all request go to current master only. Additionally, epoch number is associated with the request to detect master turnover during the request.

In order to limit the length of Paxos log, a database snapshot must be taken periodically. The snapshot is also use to allow other replica to perform catch-up after long failure.

Due to the complexity of the algorithm, the actual implementation is done by writing in the custom state machine specification language. This allows verification and change to the protocol to be much easier than direct implementation. Additionally, various ad-hoc fault injections are used as the testing mechanism because the system is too complex to use formal proofing techniques. Initially, a single thread version of the system is used to simplify test reproducibility and multi-threaded version is introduced later to increase system throughput.

Application:
This paper shows that a complex distributed fault-tolerance system can be implemented based on simple primitives that use proven algorithm. For example, a fault-tolerance log in this paper can be used as a building block for building any service that requires distributed replications.

However, people tend to build the system from scratch because they can further optimize the algorithm by tuning it to fit their requirements. This problem can be solved by adopting an approach adopted by parallel programming community. Various concurrent data structures are implemented by experts which are tuned toward specific usage.

Posted by: Thawan Kooburat | March 9, 2010 03:29 AM

Summary:
In an effort to read all of Lamport's most famous papers, I choose to read The Part-Time Parliament. This paper describes the Paxos algorithm, an efficient algorithm for maintaining a consistent distributed database in the presence of unreliable nodes. Paxos differs from previous consensus algorithms (like The Byzantine Generals) by trading a more limited failure model for implementation/runtime efficiency.

Problem:
Assuming that the primary mode of node failures is lack of communication (nodes can enter and leave the cluster without warning), we would like every node to agree on a consistent sequence of committed events. In other words, the goal of Paxos is to drive each node in a distributed state machine to the same state. We would also like to maintain a total order among committed events so that one update can supersede previous values at the same 'location'.

Contributions:
As noted in the paper, much of the Paxos algorithm already existed in contemporary or previous work (not necessarily known to Lamport at the time). This theoretical work describes a relatively efficient algorithm that implements a sort of temporarily ordered consensus system.

One interesting detail described in the paper is the presence of a fast-read and the slow-read. The fast-read sacrifices consistency for speed by assuming that the node contains current state. The slow-read actually performs the full Paxos algorithm to ensure consistency before returning the most up-to-date value.

The process of electing a president, as well as many issues related to a practical implementation are not discussed. For instance, network partitions and the resulting reconciliation process is not discussed at all. Assuming everyone agrees they are in the same parliament hall is perhaps unrealistic.

Practical Applications:
Paxos notably does not work in an asynchronous network. This is consistent with the real world where timeouts are common and easy to implement but data is still lost. This makes paxos practical enough that it is actually implemented in real systems.

Also, although the problem is less precisely defined than in other Lamport papers, Paxos maps almost directly to the problem of real-world distributed databases. This direct applicability is unusual for a theoretical paper like this. Even so, Lamport assumes a very limited failure model that does not include normal corruption events that occur in real systems (this can partially be solved by reliable transport, but this does not provide an end-to-end solution)

Posted by: Joel Scherpelz | March 9, 2010 03:07 AM

I read Paxos Made Live - An Engineering Perspective. I chose it because it is a practical rather than theoretical paper, and I am curious as to what exactly happens when one tries to implement Paxos.

Summary:
The goal of this paper was to describe the experience of implementing Paxos in a real-world system. The process of actually implementing Paxos rather than describing it in pseudo-code unearths several substantial problems not considered in the academic community. This paper describes just how the engineers dealt with these problems, and what is the final outcome of building Paxos.

The Problem:
The problem is building a consistent database. The details of the problem are different from the assumed problem in the academic papers on Paxos. Specifically, in academia it is customary to limit the types of failures and make other unreasonable assumptions. The high-level problem associated with this gap, is that developers augment original Paxos with necessary mechanisms to tackle the assumptions of theoretical Paxos. However, once the original Paxos is augmented, it is no longer guaranteed to be correct, as developers do not usually prove that augmentations will not affect the correctness of the protocol.

Contributions:
From my perspective, the major contributions of this paper is 1) showing just how much of a gap there is between theory in academia and practice in the industry, 2) outlining the shortcomings of academic proposals, such as disregard for the testing considerations. I can relate to the 1) consideration, just because in my experience there is an under-appreciation of just how hard cache coherence protocols can be in terms of implementation. And my experience is probably still better than that in the industry. So, under-emphasizing the implementation concerns can produce infeasible proposals.

On the more practical side of the contributions, the authors outline how they obtained reasonably-sized logs with snapshots, how disk corruption failures were handled, and how various problems associated with master migration were tackled.

I was actually not quite clear on how the leases worked. If the lease prevents update operations from completing and leases can be renewed, then when does the update actually get serviced? Even though MultiOp was emphasized, the practical application is not quite clear to me either.

Posted by: Polina | March 9, 2010 12:30 AM

I read Paxos Made Live - An Engineering Perspective. I chose it because it is a practical rather than theoretical paper, and I am curious as to what exactly happens when one tries to implement Paxos.

Summary:
The goal of this paper was to describe the experience of implementing Paxos in a real-world system. The process of actually implementing Paxos rather than describing it in pseudo-code unearths several substantial problems not considered in the academic community. This paper describes just how the engineers dealt with these problems, and what is the final outcome of building Paxos.

The Problem:
The problem is building a consistent database. The details of the problem are different from the assumed problem in the academic papers on Paxos. Specifically, in academia it is customary to limit the types of failures and make other unreasonable assumptions. The high-level problem associated with this gap, is that developers augment original Paxos with necessary mechanisms to tackle the assumptions of theoretical Paxos. However, once the original Paxos is augmented, it is no longer guaranteed to be correct, as developers do not usually prove that augmentations will not affect the correctness of the protocol.

Contributions:
From my perspective, the major contributions of this paper is 1) showing just how much of a gap there is between theory in academia and practice in the industry, 2) outlining the shortcomings of academic proposals, such as disregard for the testing considerations. I can relate to the 1) consideration, just because in my experience there is an under-appreciation of just how hard cache coherence protocols can be in terms of implementation. And my experience is probably still better than that in the industry. So, under-emphasizing the implementation concerns can produce infeasible proposals.

Posted by: Polina | March 9, 2010 12:25 AM

CS 739 Review Blog

Part-time Parliament/Paxos

Comments

Post a comment