Distributed snapshots: determining global states of distributed systems
M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63-75, 1985.
Review this or logical time paper for Thursday 2/16.
Comments
Seth Pollen, David Capel, Victor Bittorf, and Igor Canadi
This paper talks about reliable multicast communication protocols that work in presence of failures. The protocols also support high concurrency when possible.
First communication primitive presented is Group Broadcast (GBCAST). They use GBCAST to keep consistent replicated process group view. Every group member keeps the whole process view and it’s the same across all members. When an failure is detected, a GBCAST messages is broadcasted with a new group process view. When a member gets a GBCAST message, it updates its local state. GBCAST messages are guaranteed to be relative ordered to other events in the same way at the each member. This makes programming against it very easy, since the programmer doesn’t need to bother with inconsistent states and it’s also easy to implement consensus algorithms - for example, the leader with least IP address, since all members have information about all others. The GBCAST protocol is designed not to carry data, but to handle membership and failing across the process group.
Next protocol introduced is Atomic Broadcast (ABCAST). Atomic Broadcast provides atomic primitive - the message is delivered to all nodes or none. In addition, it provides ordering guarantees, so that two atomic broadcast with the same label will be received in the same order on all nodes.
The last protocol described is Casual Broadcast (CBCAST). The goal of CBCAST is to guarantee delivery ordering with minimum synchronization and highest concurrency. This protocol is built on top of Lamport’s potential causality. Two events A and B are potentially causally related if A could impact B. If A is not potentially causally related to B, this means they can be executed without synchronization and ordering is not important. The protocol uses clabels to further relax ordering in case A and B are potentially causally related. The authors enabled programmer to have ordering when absolutely needed and provides high concurrency if the ordering is not necessary - in case programmer said he doesn’t need it using clabels, or the two events are not potentially causally related.
At our reading group, we had a discussion about main contributions of the paper and concluded that two biggest things are asynchrony and powerful primitives to program against. However, we argued that in systems that communicate a lot with outside world, asynchorny is not that useful, since we still need to wait until all updates propagate if we want to guarantee consistency. When the problem is naturally synchronous, this approach only degrades performance. As an example, we designed a replicated queue (example used in the paper) without using the described protocols. Our queue had one master and bunch of slaves replicating the queue. All requests are synchronous and the queue has strong consistency. Since the protocols guarantee events ordering, the client of the system built on top of the protocols can accept requests on every client, which is one advantage. However, we argued that the system built using the protocols adds a bunch of additional communication overhead and it would perform worse both in latency and throughput metrics. Our system would need separate protocol for failure detection, which could introduce small downtimes since it takes time for failure to get detected.
One field we identified that would benefit from this multicast protocols is systems which don’t communicate with outside world very frequently. In that case, when the performance doesn’t degrade linearly with number of client’s requests, inside asynchrony and concurrency and powerful ordering guarantees will both improve performance and make developing systems easier. For example, we discussed scientific computing for which we believe it could make use of the provided guarantees.
However, when discussing usefulness of the approach, we couldn’t name any application that is built on top of this protocols. Our conclusion is that this primitives which guarantee customizable event ordering and provide failure handling for free are nice to have, but we believe that, in terms of outside-facing services, there are simpler implementations for specific situations that perform as well or better than systems built on top of this.
Posted by: Igor Canadi | February 20, 2012 06:47 PM
Distributed systems by their nature lack an easy method of central coordination and state measurement. If you want to observe the global state, one method would be to elect a single leader, stop everything while that leader sends out individual state requests, then resume once it collects and announces the results. This is a highly intrusive process. The author of this paper introduces an algorithm for measuring the global state of a distributed computation that can run in the background and won't disturb that computation.
His method requires nodes to take a snapshot of their current state and forward it to their neighbors while recording the messages they receive. Upon receiving a state message, a node is triggered to start this process if it hasn't already participated in this particular measurement yet. The result is not a precise global state snapshot in the sense that it captures exactly the state of the system at a particular physical time. What it provides is enough for our purposes as it gives a snapshot of a valid computation state that could have happened during the measurement.
The method is simple, non-intrusive, and doesn't require single source but can have multiple independent trigger sites. I like it.
One major fault of the paper is that it doesn't account for failures, either by system nodes or in the communication network. It is certainly not wait free. These aren't strong concerns as the paper was innovative for its time by introducing a new state measurement system but they are significant issues in the community literature. Producing such a system in the presence of faults is a natural follow-up.
A parallel topic would be self-synchronizing systems. There, one does not need to measure global state to test for some property as that property is guaranteed to restore itself in some amount of time if it is lost through a fault.
Posted by: Brian Nixon | February 16, 2012 08:41 AM
This paper proposes a way to determine a distributed system's global state during a computation. It collects the states during the computation, then use a user-designed predicate to judge the final or initial global state.
The distributed system is described as a directed graph, in which nodes are processes and links are communication channels between processes. Messages are delivered by links between nodes. There are states on links and nodes. The nodes periodically records its states(not necessarily to be syncronized). While a mark is passed among nodes. When a node receive a mark, it will record the messages int the duration between the time of its last records and the time of this mark as the state of the incoming link. Put all states together, we will have a global state S of the whole system. Then by the proof of theorem 1, there exists a sequence of events that can let the system evolves for the initial states to the final state via stat S. User designed a predicate y that guarantees the following property for all S1->S2, y(S1)->y(S2). Then we can use y on S and get y(S). If y(S) is true, as y(S)->y(final), the property holds finally, while we do not know y(initial); if y(S) is false, as y(initial)->y(S), the property does not hold initially, and we do not know y(final).
Discussion: (1) This paper only discussed two-process scenario by examples. In case of multiple processes, the topology is not a simple chain, and there may be multiple message generators, it is hard to generate a uniform mark for all message. This algorithm only works for a chain topology and the mark and message should be sent from one end to another end. If A->B->C, D->E->C, A and D are message generator, no matter A or D generates the mark, to pass the mark to the whole system, their should be a link that mark and message go on different direction, and in this case, the recorded state on that link will be wrong.
(2) The predicate need to be design by user, this seems to be a strong assumption. One way I think up is that let the messages on links goes to the head of the link and go on simulate the nodes behavior. Then we get a sub-final state on nodes, so we can judge the predicate. But in fact the node behavior is not always predictable due to the reason that the information on the node is not fully collected. I hope instructor can give a practical example of this algorithm.
Posted by: Wenfei Wu | February 16, 2012 07:23 AM
The paper describes an algorithm to determine the global state of a distributed system during computation. The algorithm is based on the insight of the relationship between local states of the processes, states of communication channels and the global state. It guarantees to capture a “meaningful” (i.e. the captured state is one of the states that is reachable from initial state, and reachable to final state) snapshot of the global state of the system. The algorithm can also be used to detect stability conditions, which is useful for detecting properties of the system such as system termination and deadlocks.
The algorithm is simple, and it is presented using a simple model which makes it easy to understand. Yet, it has some good properties. It does not require any kind of central control or global timing to capture the global state. Recording the state is done asynchronously at each local process with the help of marker messages. Upon receiving the marker, the process record its local state if it has not been done before. It also records the state of the incoming channel where the marker comes from, before sending it out to other processes.
Although the algorithm is simple, there could be difficulties in using it directly in practice. The model of the distributed system, on which the algorithm is described, assumes that the channels have infinite buffers, are error-free, and deliver messages in the order sent. This is not a practical assumption, because the packets in the network can get lost. This should have been a bigger problem during the time the paper was published where the networks are less reliable. The paper also does not provide the upper bound for the time taken for the algorithm to terminate. Therefore, it could happen that the global state captured represents an outdated data by the time the algorithm completes.
Despite these possible problems, the basic idea of the algorithm --- to use marker messages to asynchronously coordinate the actions --- could still be useful in practice. We will need to add some techniques to ensure that markers (and other messages) got delivered correctly. The model described in the paper only allows a single event to occur at a time, but in practice multiple events can happen at the same time. However, because the algorithm works with nondeterministic computations, this should not be a problem. In addition, the algorithm can be initiated by any one of the processes. Therefore, there should be some mechanism to coordinate the initializations to avoid unnecessary computations and network traffic.
Posted by: Min Thu Aung | February 16, 2012 05:29 AM
The process of capturing the global consistent state of the distributed system is discussed in this paper. It talks about how a non-delayed state of the whole system can be read. It doesnt talk about the process of broadcasting the states of the nodes in the distributed system to construct a global state but instead looks at achieving quick updates of state information at the individual nodes by frequently forcing the nodes to update their states. The problem discussed here is to detect stable states of a system like say deadlock detection, termination of all processes of an application, etc. If a system is said to be in a stable state, that means it will be in stable state for sufficient amount of time in the future as well, hence plotting a continuous distribution of the state of all the nodes would in turn depict the state of the whole system.
The paper models the process of state changing of a system by a state machine and hence I feel that it provides for easier understanding. There is clear explanation of what kind of inconsistency happens in the system. The solution seems to be to send markers after a fixed number of packets have been sent. Once the marker is received at the other end, the receiver process is expected to update the state its state as well as the channel connecting the two processes. Also I feel that the paper is too abstract in telling how it detects whether a state is stable. It would have been better to have an example to describe this.
Some of the problems I feel with the paper are:
1. The process of frequently updating the state in the nodes as well as broadcasting the data which i think will happen for each state change might induce an overhead in the system. The global detection shouldn't degrade the computation capability of the system.
2. Does the implementation do a good job of handling failures? What happens when there are failed nodes in the system? Does the system know the node is actually down or it is in some stable state and unable to respond?
3. I dont know if the paper was meant to describe this but there is no clear distinction as to whether the type of the stable state can be found by analyzing the nodes through this process.
The process of detecting the global state of a system is a tricky and interesting problem. The paper gives some good approaches for addressing this problem but I feel that the paper is incomplete on some levels as explained above.
Posted by: Srinivas Govindan | February 16, 2012 03:06 AM
Distributed Snapshots is a paper that proposes an algorithm for obtaining a full
"snapshot" of the state of all nodes and links in a distributed system.
The algorithm described is said to be useful to solve problems about the state
of the system which have a "true" or "false" answer and are stable properties.
This is not a trivial problem, since it is very challenging to achieve a full
picture of the state of a distributed system of computers. This is because each
computer needs to exchanges messages with any other computer it wishes to know
state information for, and in the time it takes to gather all of this
information (especially in a system where there is no master machine), the local
states of machines could have changed significantly. Also, the fact that any
machine could fail before or after reporting local state adds complexity to the
problem, since it is impossible for another machine to know if a machine is not
running or just very slow. This full picture of state in a system is useful for
monitoring system health and debugging distributed algorithms.
The contributions of this paper is the "snapshot algorithm" and the proof that
the snapshot will return a valid, accurate state. The snapshot algorithm is
different since it uses "marker messages" to signify to other hosts to start
recording their own local state, with each process that receives a marker
recording the state of the connection it receives the marker on. This ensures
that the first message to be included in the snapshot does not get duplicated or
lost, as it could if the process initiating the process recorded the state of
its communication links.
As with the other paper we were assigned, the paper does not discuss failure of
any of the nodes, and it appears the algorithm would hang if it waits for a
response of global state from all nodes. The biggest flaw in my opinion is the
usefulness of the algorithm. It is absolutely useful to get an accurate global
state from a distributed system for the types of problems mentioned (checking if
a computation has terminated, the state of tokens of a system, etc.). These
types of operations are used infrequently, and there is no guarantee on
performance of the algorithm. Distributed systems used in practice that
determine customer experience with their performance cannot afford to do
operations that need to know a complete global state, so they usually do not
have a need for an algorithm like this.
Even though this algorithm is not heavily used by algorithms that value
availability more than consistency, checking for a deadlock or whether a
computation has terminated is useful for lengthy scientific computations on a
supercomputer or other distributed computer system. Developing a large-scale
algorithm like this may need to use this algorithm or one like it to debug
problems when scaling up, to see if a shared resource is staying in one place or
if a computation is making any progress.
Posted by: Evan Samanas | February 16, 2012 01:22 AM
Seth Pollen, David Capel, Victor Bittorf, and Igor Canadi
The paper presents a decentralized algorithm for taking a meaningful snapshot of the global state of a distributed system. This solves a significant problem, since taking a snapshot of a geographically distributed system will require some time to complete, and during this time the nodes in the system will continue to operate, changing state and sending new messages.
The snapshot produced by Chandy’s and Lamport’s algorithm can be collected without interrupting or rescheduling the system’s normal operation. Although the resulting snapshot may not indicate a state that was ever present in the system at any point in physical time, it will represent a state that could have come about in the course of the computation, and there will be no way for an external observer, using only the axioms of the system itself, to show that the snapshot state did not actually exist at some point in physical time. The nature of this snapshot is shown to be useful in determining the existence of stable properties in the system, since the system cannot (under normal conditions) exit a stable property once it has entered it.
The key contribution which makes the algorithm possible is the idea that messages can, in a way, be pushed back into a communication channel for the purpose of preparing the snapshot. When a snapshot is being taken of the system, each node records its state exactly once (when it first receives a snapshot marker from any other node). The node will subsequently receive markers on all its incoming channels, but it may receive normal messages on any channel before the marker arrives. For the purpose of the snapshot, these normal messages must be considered to be still in transit in the channel, since they did not have the opportunity to affect the node before it recorded its own snapshot state.
It hardly need be pointed out that this paper makes no provision for node or network failures. However, the algorithm given may still be able to produce correct results in the event of single-node failures. If a node fails, for example, after receiving its first marker but before the algorithm completes, any messages sent to it preceding a marker will be lost from the snapshot. But this may be the desired behavior, since those messages may have indeed been lost if they were sent to an unresponsive node. The algorithm also degrades gracefully in the presence of partitioned systems. If a node on one side of a partition initiates a snapshot, the resulting snapshot will represent only the nodes and connections on that side of the partition, and this is probably as good a definition of snapshot as we could ask for in a partitioned network.
This algorithm does seem over-designed, however, for many of the stable properties it offers to detect. Take, for example, detection of program termination. Termination is not only a globally stable property, but it is also locally stable: once a node’s computation has terminated, computation cannot resume on that node without some new administrative directive. Thus, global program termination can be detected easily by having each node, when it reaches the end of its computation, either report its termination to a central server or broadcast its termination to all other nodes. In this way, either the central server or all nodes in the system will eventually receive enough information to know that the global computation has terminated.
Some stable properties may not be as easily fitted to this active-reporting model and may require more complex algorithms such as Chandy’s and Lamport’s. For example, the disappearance of tokens from a token ring is not a locally stable property. Just because a node has lost all its tokens does not mean it will not, in the course of normal computation, receive tokens again from other nodes. Thus, a central node trying to track the existence of tokens in the system cannot always be certain where the tokens are without resorting to the kind of snapshot algorithm proposed in this paper.
We wonder if this algorithm could find good use in connection with epidemic algorithms. If periodic anti-entropy is not used, then the cessation of a rumor’s spread through a system may be considered a stable property. Thus, a snapshot taken of the entire system could reveal which updates, if any, have ceased to be infections before reaching all nodes. It is possible, though, that this problem has enough local stability to make the full snapshot algorithm unnecessary.
To conclude, the algorithm in this paper is very well-reasoned and is highly scalable, since each node is responsible only for its own state and the states of its incoming connections. It is likely to be applicable to the detection of dynamic global properties such as token-ring invariants, but it seems over-designed for a large population of simpler distributed properties.
Posted by: Seth Pollen | February 15, 2012 08:27 PM
This paper describes an algorithm for determining the global state of a distributed system during computation with an eye toward detecting stable properties of the system. A stable property is one such that, once the system has the property, it continues to always have the property. A distributed system, in this paper, is defined as a set of nodes, together with a set of unidirectional channels connecting the nodes, such that each node and each channel is in a state and any given time, and the channels have associated with them queues of messages.
The algorithm basically works by following a marker-sending rule and a marker-receiving rule. The paper shows that, using the distributed system model described, if the processes each follow these rules, then the states recorded by each process will together constitute a global state which is consistent with the computation of the system -- i.e., the global state recorded will be one that is a legal configuration of the system. The paper does not show that the global state recorded is "correct" in that it corresponds with the actual global state because the notion of an "actual global state" apart from a legal global state is undefined for the model of distributed systems that the authors describe.
Two flaws seem apparent to me. The first concerns equations (1)-(4) under section 3.1. In presenting the justification for these equations, the authors motivate them by giving examples of situations in which the recorded global state could be inconsistent such that, in those examples, if the processes obeyed the equations, the recorded global state would not have been inconsistent. However, this is not proof that (1)-(4) are sufficient or necessary for consistency.
The second flaw concerns the model of distributed systems that the authors use. It appears to me that, in the model, the processes are modeled as finite state machines. However, real computers are better modeled as linear bounded automata. This casts doubt on the proof of Theorem 1 later because the authors seem to rely on the FSM interpretation of the processes. It is possible that the proof does not rely on this, but I will have to think more about it. Linear bounded automata are considerably more powerful than finite state machines. If the authors really do rely on that aspect of their model, then it strongly decreases the generality of their results. However, there would still be plenty of applications left in other areas (for example, deadlock detection) as long as it is understood that the process state's being recorded constitute a small subset of all the properties of that process (e.g., whether that process possesses a certain token or not). It is not at all clear to me, however, that this algorithm would work for, say, a distributed database.
Posted by: James Paton | February 15, 2012 07:02 PM
In this paper, the author proposed a model to determine the global state a distributed system by only record the state of each process and the messages between without affecting the computing.
To determine the global state of a distributed system is very important. For example, it can be used for the termination detection and deadlock detection. To achieve this without a central coordinator is not easy. The distributed system discussed in this paper consists of a bunch of
processes which communicate through message channels. All the processes can not record their local states at the same instant if there is no central clock. The author devised an algorithm that local states are recorded separately and they are collected to form a global state later.
In the paper, each process records the state of itself and the incoming channel. And later the all the records can be collected to generate a global state. However, by examine all the global states, we can find that this global state doesn’t really exists during the computation in the system. The author proves that this global state can be reached from global start state and can lead to the final global state. Also the author proves that this global state can be obtained by rearrange the events that really happened. If we move all the prerecording events before all postrecording events, the global state is the one between the prerecording events and postrecording events.
The major contribution of this paper is that it can determine the global state of the distributed system by using a very simple mechanism. Even though the global state obtained is not a real global state, the author has a clever way to show its equivalence to a real computation sequence, and it’s equivalent for us to understand the system’s state.
The algorithm is very simple and easy to implement. Also it will not affect the computing work of the distributed system. But the author made some optimal assumption about the system. First the author assumes that the buffer size of the messages sent between processes is infinite and the message sent will always be received, which in practice is not the case. We will have to devise complicated mechanism, if available, to deal with the message failure to make the original algorithm work. The author didn’t discuss much about the collecting of the records from all the processes. Actually, it can be a problem if the failure of nodes is take into consideration, which can lead to a incomplete information about the global state.
In conclusion, the author devised a very good algorithm that can determine the global state a distributed system in an elegant way. It’s simple to implement, and each process only needs to record the states of itself and the incoming channels.
Posted by: Xiaoming Shi | February 15, 2012 06:36 PM