Paxos Made Simple
Leslie Lamport. Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
Review for this or other Paxos papers due Tuesday, 2/28.
« The Part-Time Parliament | Main | Paxos Made Live – An Engineering Perspective »
Leslie Lamport. Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
Review for this or other Paxos papers due Tuesday, 2/28.
Comments
Review for the paper, "Paxos Made Simple"
This paper presents the theory and implementation of Paxos consensus algorithm, where a set of independent distributed processes try to agree upon a single value from a set of proposed values.
A distributed system involves many independent processes working together on a common goal to provide some service. Often the end-user is not aware of the distributed nature of the system and the system needs to provide an illusion of a single system. This in turn requires some kind of coordination between processes in many situations. One instance would be when all the processes want to arrive at a globally agreed view of the system that they want to show to the end-user (sequential consistency - global ordering of events). This problem is exacerbated by the presence of process and network failure in the system. This paper tries to solve this problem using a protocol (or algorithm).
The system model consists of many distributed processes which fall under three classes/types: proposer, acceptor and learner. The Paxos consensus algorithm has two phases.
Phase 1 - Prepare phase: In this phase the proposers send a number n to some set of acceptors and acceptors responds to a proposal n iff it is greater than the highest-numbered proposal (n-high) it has accepted, with a promise that it won't accept any proposal lesser than n and also the value of n-high. Note a proposer can send many such proposals.
Phase 2 - Accept phase: If the proposers receive response from majority of the acceptors, then it sends an accept request with one of the proposal number, n, which is the highest-numbered proposal among the responses and a value v . Acceptors accept the request only if it has not responded to prepare request with a greater proposal number.
This two phase algorithm helps for all the acceptors to arrive at a consensus value v. This value should be communicated to all the learners in the system. It can be seen that the knowledge about "phase" is known to only the proposers and which is independent of the phase that the other proposers are in. This can result in a infinite number of retries where each one trying to make others' accept request to not be accepted (since acceptors can see prepare or accept request from proposers in any order and the only constraint is that a prepare is ordered before an accept for the proposal with number n). This results in a livelock which can be solved by having only a single proposer in the system.
The author then suggests an implementation using a state machine and describes a design where multiple instances of the paxos algorithms can be used to arrive at a global ordering of a set of client commands.
It is indeed very promising that the paxos algorithm can achieve (at least theoretically) something like serializing a set of commands given to a distributed system, which is infact a fundamental and important problem in distributed systems. It should also be noted that this is achieved in presence of any non-Byzantine failures. But it should be noted that the idea presented has some loose ends which is not addressed in this paper. Firstly, the algorithm is abstract about the "set of acceptors" or "majority of acceptors" a proposer needs to send (should it be more than half or 2/3rd of the acceptors?). It is also not intuitive to involve only a subset of acceptors in case when using only one distinguished proposer. Secondly, it is not clear when the algorithm will terminate, or when the set of acceptors can safely decide to distributing the accepted value to the learners. It might also have many concurrency issues (which is hard to see without at least simulating the algorithm in a real environment) since the phases are not actually a "global" property rather a per process state. Finally, the assumption that a facility exists for choosing a unique proposal number (though the final version is using only one distinguished proposer which solves it), is in itself seems like a consensus problem.
Distributed system definitely is always in need of a consensus "protocol", where at the end of this protocol a consensus of a set of values can be reached and this paper suggests one such possibility and theoretically prove it that it indeed solves the consensus problem. Though there are some practical issues and performance issues (how fast an implementation can be in arriving at a consensus) that need to be solved to be useful in reality.
Posted by: Venkatanathan Varadarajan | February 28, 2012 08:00 AM
This paper proposes an algorithm to solve the consensus problem in which a network of systems are required to choose a single value out of the set of proposed values. The consensus problem is especially important in distributed systems because of the lack of a centralized point of control which makes it difficult to arrive at a single decision among all systems.
The Paxos algorithm works in two phases. In the first prepare phase, a proposer sends a prepare request with proposal number n to a majority of acceptors. If the acceptors have already responded to any number greater than n then it ignores it. Otherwise, it responds with the value of the highest numbered request that it has accepted, if any. In the second phase, the proposer once it receives reply from the majority of acceptors it sends an accept request to them with a proposal number n and value v that it had received. If it hadn’t received any value then it generates its own value. The acceptor then accepts the request unless it has responded to any prepare request greater than n. Once a decision has been taken by the acceptors independently they send out messages to the learners about the decision. To enable lesser traffic a distinguished learner is chosen that spreads the decision to others.
The algorithm makes the assumption that failures of nodes may happen and the network is unreliable (i.e) packets can be delayed or lost but not modified. There are a few drawbacks in the original algorithm. The algorithm will not make progress when two proposers keep proposing alternatively with increasing numbers with no decision being made. The paper argues that without the distinguished proposer they may not be progress but it is still safe and suggests that this can be solved by making the proposer distinguished. I think this is a great limitation on the algorithm as the actual consensus problem itself arises only when there is more than one proposer. So the actual essence of the distributed-ness of the algorithm comes into play only when there is a failure or when more than one distinguished proposer is elected at the same time (which again think is going to be another consensus problem). In such a case there is every possibility that the algorithm doesn’t make any progress as stated above. Also, when two or more proposer generate a prepare request with the same proposal number simultaneously, the algorithm is not going to work as two different values may get accepted. To prevent this it is suggested that a disjoint set of numbers are to be chosen among the proposers. This is again a consensus problem but since the set is not dynamically changing it is not going to be an issue, except in cases where proposers are added to the system.
I am not quite sure how the paper claims that this algorithm meets the safety requirement that only a single value is chosen when two distinguished proposers happen to exist at the same time. For example let us consider the case where there are 3 acceptors (A1,A2,A3) and 2 proposers (P1,P2). If P1 sends prepare with n to A1 and A2 and gets a reply with no value returned and P2 sends prepare with m>n to A2 and A3 and gets a reply with no value returned and sends the accept message. A2 and A3 accept the proposal m. Now P1 sends the accept message and A1 accepts it because it is unaware of the state of the other two. If a network failure happens and A2 does not receive P1 and keeps retransmitting. But by then since A1 has accepted a value it communicates to its learners about n while A2 and A3 would communicate about m. (This is based on the understanding that the acceptors do not decide on the fact that when a consensus is made and so communicate the decision to the distinguished learner as soon as every decision is made). Doesn’t this violate the safety condition?
In conclusion, the paper tries to solve the consensus problem with a phase approach. But I have my own doubts if it would solve the actual consensus problem handling all the failure cases.
Posted by: Madhu Ramanathan | February 28, 2012 02:13 AM
Paxos is a distributed systems protocol the describes how to build a system which is free from inconsistency in a network that may contain members running asynchronously. The basic layout of this protocol consists of two distinct node types. Proposers (leaders) who can propose value changes and send request's for values being changed and Acceptors who can accept or reject these proposed changes. This protocol works by having the proposers issue a request (number N with value V) to the acceptors. The acceptors can accept this proposal (which in turn they agree to reject any proposal with number less then N) or reject this proposal if the acceptor has already accepted another higher value N proposal. If a quorum is reached with a majority of the acceptors then a proposal commit (accepted) message is sent to set the value V. Since this can happen asynchronously if an acceptor receives a accepted request for a value N less then some other value it has promised to accept it can ignore that accept request.
There are a few issues I have with the algorithm as described in this paper. The issue brought up when dealing with multiple proposers getting caught in an infinite loop (where they each continually issue higher request N's without any being accepted) didn't seem like it was fully resolved in this paper. There method of selecting one distinguished proposer seems like it could cause issues if that proposer fails. They didn't seem to suggest a way for the backup proposer to know if this has occurred so there could be instances where there is no availability in the network (where the main proposer fails in some manner and its not detected by the backup). Another issue I have with this algorithm is that it can significantly delay the actual acceptance of a value due to some nodes already containing higher value N proposal's. Since this algorithm is designed to run in an asynchronous network, some slower nodes may contain proposal values of higher N then the proposal being issued by the proposer. These slow response times seem like they could be an issue in cases where a proposer fails and a backup proposer is brought online (causing significant delay in finding an N large enough for a proposal to be accepted).
Even though this paper seems to have some issues, I generally enjoyed reading this paper and it was a very good attempt at building a distributed algorithm that could operate on networks with asynchronous links. Other versions of Paxos may address my concerns about some of the features of this distributed system.
Posted by: Benjamin Welton | February 28, 2012 01:53 AM
This paper proposes a (halting) fault tolerant algorithm to solve the consensus problem. It assigns a unique number to each proposal as the proposal ID, makes use of the order of proposals’ ID and enforces that a majority of acceptors must be involved to reach agreement.
It is important to be able to reach agreement when servers can go up and down. To avoid single point failure, we may use state machine replication so that when one server goes down, other servers can keep calculating and providing service. The requirement is that different servers must execute the same input sequences as the client provides. If servers do not have agreement on what is going to be executed, a command may be executed on server A, but lost to server B. Clients then may get erroneous results.
The main contribution of this paper is it reasons the problem and then comes up with a solution. By adding assumptions and keeping strengthening conditions, it comes up with the final two-step algorithm. It is very smart to assign each proposal a global unique ID and make promise that acceptors will not accept proposal with lower ID in the future. Compared to other ways of enforcing orders, this one is easy to implement and very efficient. We can assign {1,K+1,2K+1...} for proposals from server 1; {2, K+2, 2K+2...} for proposals from server 2 and so on. It is also beneficial to consider why there must be two steps in the algorithm. Suppose the
algorithm only has one step. For an proposer, it must send the value together with the proposal ID to acceptors. Unless the proposer happens to choose the same value as the one acceptors have accepted, the proposal cannot be accepted because that would violate P2. So it is necessary to have a prepare step to collect the value that the majority acceptors have accepted if there is one.
One flaw of the paper is it does not cover the situation where distinguished proposer can fail. Suppose distinguished proposer A fails and B is elected as the new distinguished proposer. After a while, A reboots and recovers. Then the system now actually have two distinguished proposers, which violates the assumption that there can be only one distinguished proposer. Running the election algorithm again can solve the problem, but before that time point, the system may not make any progress, which causes bad performance.
Paxos now represents a family of algorithms for solving consensus problem. More optimization is considered and some version of the algorithm can even tolerate Byzantine failure. It is widely used in systems that implement state machine replication.
Posted by: Xiaozhu Meng | February 27, 2012 10:35 PM
The paper talks about a protocol for establishing a consensus in a distributed
system. The protocol works in asynchronous systems which have only
non-Byzantine failures. The aim of the protocol is to have a consensus among
all the participants and ensure that everyone knows what the consensus is when
one is reached. The protocol involves a set of acceptors, a set of proposers
and a set of learners. The role of a Proposer is to propose a value and the
role of Acceptor is to reach a consensus among the values proposed by the
proposers and convey the value on which consensus has been reached to the
Listener. It does not matter which proposer wins, but only one should win and
the value of the winner is chosen as the consensus and propagated to the
listeners. The protocol proceeds in two phases. In the first phase, the
proposers chooses a proposal number and sends it as a part of the prepare
request to a set of acceptors. The acceptor responds with a promise
guaranteeing that they will not accept any proposal numbered smaller than to
this and the proposal with the highest number that has been accepted so far.
This makes sure that the proposal numbers that are used are monotonically
increasing as long as everyone is non-malicious. In Phase 2, once the proposer
hears back from a majority of acceptors, it then sends the accept request with
the value of the highest numbered proposal among the proposals it heard back
from the acceptors or its own value if there were no other proposals less than
his proposal number. The acceptor can accept an accept request if he has not
responded to any other prepare request with higher proposal number. Once a
value has been accepted, the acceptors can let the learner's know about them
in an efficient and reliable way.
The basic paxos algorithm guarantees correctness but not progress. It is
possible for a set of non-malicious proposers to never make a proposal that
will be accepted. The paper talks about using a designated proposer to
tolerate this and having the acceptors keep track of their current state by
persisting it to disk. They also describe a state machine that implements a
distributed system. This part of the paper is non-intuitive since usage of the
terms client and server is different from the standard context of distributed
system. It was difficult to perceive this and understand the rest of the section.
I think the paxos algorithm is very useful in forming quorum like in dynamo
paper where multiple values are available and some form of consensus needs to
be acheived. However, it all relies of choosing a leader who is the designated
proposer. Does this not boil down again to the basic problem that paxos
algorithm is supposed to solve ? - to choose one value(leader) among different
proposals(participants).
Posted by: Sathya Gunasekar | February 27, 2012 08:54 PM
In this paper, the author presented the Paxos algorithm in plain English.
Paxos is a protocol to resolve the consensus among a collection of processes. It makes sure that only one value among the proposed values will be chosen and learned by all the processes. It’s very important for distributed systems to have consistent data information. The reason the consensus it’s not easy to deal with is that on one hand failures of processes and message passing can cause inconsistency, on the other hand more communication can ensure consistency but it’ more costly. Paxos resolves the consensus in a efficient way and it’s used widely by industries.
Paxos is performed by three roles: proposers, acceptors and learners. Proposers send proposals to a set of acceptors. Proposers first send a prepare message to the acceptors with a sequence number n. And acceptors will respond a promise not to receive the proposal with sequence number not larger than n along with the previous value with the largest sequence number. Proposers then send the accept request with the value that has the largest sequence number or any value if no value returned by the acceptors. Once received the accept request and no other request has larger sequence number the acceptor store the value for the sequence number, and forward the value to the learners.
Since failures may happen, and other proposers may conflict with this proposer, there should be some mechanism to deal with these problems. In Paxos they can be detected by checking the number of messages received and the sequence number. And once collision or failure is detected, the proposer needs to retransmit the request with a larger sequence number. This ensures that there is always only one value will be accepted for a sequence number.
Another problem is that two proposers may compete and none of them can make progress. In that case a distinguished proposer may be needed, and only distinguished proposer can send proposals.
There are many optimizations that can be applied to improve the performance. In practice, usually the proposers, acceptors and learners services are run on the same server. Each server plays multiple roles. Informing the proposers the failure of prepare request is good for the proposers to start retransmit with higher sequence number earlier.
In the paper, the author also shows how to use the Paxos algorithm to generate the input command for the state machines.I think Paxos is a straightforward algorithm. It can be used for distributed system to keep the data consistent.
Posted by: Xiaoming Shi | February 27, 2012 07:40 PM