« Paxos Made Simple | Main | The Chubby Lock Service for Loosely-Coupled Distributed Systems »

Paxos Made Live – An Engineering Perspective

Tushar Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live – An Engineering Perspective]]. PODC '07: 26th ACM Symposium on Principles of Distributed Computing, 2007.

Review for this or other Paxos papers due 2/28.

Comments

This paper gives an overview of some of the failures and implementation details involved in building a fault-tolerant distributed database system using some of the ideas in the Paxos algorithm for Consensus. Paxos is a consensus algorithm that makes sure that sequential consistency is guaranteed in a distributed system with some number of acceptors, proposers and learners. The proposer sends a message to a group of acceptors. An acceptor accepts a message with a certain value associated to it that is sent by the proposer. Consistency is ensured by making sure that only a certain set of operations are first executed before the later operations are executed. The paxos algorithm initially starts off with a 3-way handshake algorithm, with a propose message from the proposer being acknowledged by the acceptor if it is fine with the proposed node being the proposer. Once the proposer gets an acknowledgement from a majority of the nodes, it can proceed to send the messages necessary. The details as to what message is actually committed is learnt from the acceptors which send the received messages to the learners.
The fault-tolerant system described in the paper is built for google's "chubby" lock provision system to maintain the state for the locks. The main objective of the paper is to have a highly consistent system where all operations are applied in the same order in all the replicas along with fault tolerance. Fault tolerance is guaranteed due to presence of replicas. The architecture involved is built on top of paxos algorithm at a low level. It is a master-slave architecture where in there is a coordinator which broadcasts all requests to other replica nodes. If the master fails, then a new one is elected automatically. This being the case, there will be ordering of requests if you run paxos for each write in the system. If the same approach is used for reads, it might become inconsistent if data is just locally read from the master as the master might have changed since last time. At the same time, running paxos is way too expensive as it involves overhead for reaching the consensus.
This is where I feel the paper does a good job of conveying the fact that paxos is not suited for big performance and some workarounds are needed to get a mix of good performance and good consistency. The paper introduces master leases wherein masters continue to remain as masters for some amount of time and the lease can be renewed to extend the time as needed.
I think the paper addresses a lot of aspects related to fault-tolerance and also does a good job of combining with some performance-improvement measures. One such example is the snapshot which makes sure that all of the log from the start of the usage of the database need not be retained. As long as a snapshot of the current state of the system is taken, all operations that precede it can be removed and recovery is also faster if the snapshot is reapplied as it is rather than traversing the log from the last failed operation and reconstructing the consistent state for the database. On top of this, the implemention also takes care of atomicity of blocks of operations.
I think overall, this is a very good exercise on how to build a fault-tolerant, consistent distributed storage system. Blindly copying the principles behind the paxos algorithm does not scale well and some design decisions have to be made to compromise on this.
Some doubts:
1. Can this implementation be extended to have multiple coordinators in the system? Of course, this would require communication between the coordinators to make sure all the coordinators are functioning as expected and to see if some coordinator in the group went down, etc. But this would make reads considerably faster.
2. I feel that the sloppy quorum is almost an optimized version of the paxos algorithm. The sloppy quorum guarantees consistency(R+W>N) and fault-tolerance but is a lot more performant.

Post a comment