« Distributed snapshots: determining global states of distributed systems | Main | Practical Byzantine Fault Tolerance »

Reliable communication in the presence of failures

Kenneth P. Birman and Thomas A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems (TOCS), Volume 5 Issue 1, Feb. 1987. You can skip section 4 (Implementation)

Review due Tuesday, 2/21.

Comments

Seth Pollen, David Capel, Victor Bittorf, and Igor Canadi

Review
This paper talks about reliable multicast communication protocols that work in presence of failures. The protocols also support high concurrency when possible.

First communication primitive presented is Group Broadcast (GBCAST). They use GBCAST to keep consistent replicated process group view. Every group member keeps the whole process view and it’s the same across all members. When an failure is detected, a GBCAST messages is broadcasted with a new group process view. When a member gets a GBCAST message, it updates its local state. GBCAST messages are guaranteed to be relative ordered to other events in the same way at the each member. This makes programming against it very easy, since the programmer doesn’t need to bother with inconsistent states and it’s also easy to implement consensus algorithms - for example, the leader with least IP address, since all members have information about all others. The GBCAST protocol is designed not to carry data, but to handle membership and failing across the process group.

Next protocol introduced is Atomic Broadcast (ABCAST). Atomic Broadcast provides atomic primitive - the message is delivered to all nodes or none. In addition, it provides ordering guarantees, so that two atomic broadcast with the same label will be received in the same order on all nodes.

The last protocol described is Casual Broadcast (CBCAST). The goal of CBCAST is to guarantee delivery ordering with minimum synchronization and highest concurrency. This protocol is built on top of Lamport’s potential causality. Two events A and B are potentially causally related if A could impact B. If A is not potentially causally related to B, this means they can be executed without synchronization and ordering is not important. The protocol uses clabels to further relax ordering in case A and B are potentially causally related. The authors enabled programmer to have ordering when absolutely needed and provides high concurrency if the ordering is not necessary - in case programmer said he doesn’t need it using clabels, or the two events are not potentially causally related.

At our reading group, we had a discussion about main contributions of the paper and concluded that two biggest things are asynchrony and powerful primitives to program against. However, we argued that in systems that communicate a lot with outside world, asynchorny is not that useful, since we still need to wait until all updates propagate if we want to guarantee consistency. When the problem is naturally synchronous, this approach only degrades performance. As an example, we designed a replicated queue (example used in the paper) without using the described protocols. Our queue had one master and bunch of slaves replicating the queue. All requests are synchronous and the queue has strong consistency. Since the protocols guarantee events ordering, the client of the system built on top of the protocols can accept requests on every client, which is one advantage. However, we argued that the system built using the protocols adds a bunch of additional communication overhead and it would perform worse both in latency and throughput metrics. Our system would need separate protocol for failure detection, which could introduce small downtimes since it takes time for failure to get detected.

One field we identified that would benefit from this multicast protocols is systems which don’t communicate with outside world very frequently. In that case, when the performance doesn’t degrade linearly with number of client’s requests, inside asynchrony and concurrency and powerful ordering guarantees will both improve performance and make developing systems easier. For example, we discussed scientific computing for which we believe it could make use of the provided guarantees.

However, when discussing usefulness of the approach, we couldn’t name any application that is built on top of this protocols. Our conclusion is that this primitives which guarantee customizable event ordering and provide failure handling for free are nice to have, but we believe that, in terms of outside-facing services, there are simpler implementations for specific situations that perform as well or better than systems built on top of this.

The paper talks about different approaches to establish a reliable form of
communication in a distributed system that is prone to failures. The paper
presents some interesting approaches on how messages can be broadcasted to all
nodes in the presence of failures and also guaranteeing some ordering between
the messages and failure events. The distributed system assumes a set of
processes which are co-operating with each other, working on some local data
and communicating to others using messages. Processes are grouped and each
process in a group needs to know about every other process. The failures here are considered
to be halting and are detected by timeouts followed by some consensus protocol
to account for confusing with messages being dropped. The authors claim that
by guaranteeing some ordering as a part of the communication layer of a
distributed system, the software and applications built on top of it are less
complicated. The previously existing protocols for broadcast have either high
latency or assume a complete broadcast to all the nodes instead of some
specific subset which might vary with time.

To achieve high concurrency, asynchronous broadcast is preferred, where it is
not guaranteed that all the sites will get a message at the same time, but
some ordering is guaranteed to make the outcome deterministic. The paper
presents three broacast protocols which can be used either synchronously or
asynchronously. The GBCAST guarantees that among all
destinations which are recipient to same set of messages, the ordering will
always be the same relative to other broadcasts. The ABCAST guanrantees that
all the messages will be delivered to all destinations in the same order which
need not be predetermined. The CBCAST guarantees that all causally related
broadcasts will be delivered in the same order which can be pre determined.
This is achieved by using labels that have some partial ordering similar to
the logical time clocks in Lamport paper. To account for varying group
memberships with time, the membership info can be cached and a number
representing the current state is sent as a part of the message. Assuming all
the processes have the same view of the group memberships, the messages could
be either dropped or accepted depending on whether the membership info has
changed since the time the message had been sent.

The paper highlights the tradeoffs between synchronously broadcasting and
asynchronously broadcasting for higher concurrency, while at the sametime
guaranteeing some ordering. The example on read and write locks provides a
good intuition on the impact of the different forms of broadcasts.

The protocols are suitable when there are no network partitions. In case of
partitions, the authors argue saying that the communication will be blocked
but not erroenous. This doesnt seem to be acceptible in scenarios like dynamo,
since the whole system now will suffer and typically in replicated data
scenarios, a copy is maintained outside the same data center. I am not
completely sure if the messages can be ordered once the network partition
disappears. Also, the authors argue that ABCAST is stronger than CBCAST since
it guarantees total ordering of all messages. Since ABCAST does not guarantee
pre determined order, CBCAST can be considered stronger, from that
perspective.

Reliable Communication In The Presence Of Failures

Birman and Joseph introduce a communications toolkit for distributed systems in this paper. It is a system mainly concerned with complete failures, ones where the system stops responding. Communication is done inside process groups using atomic broadcasts, group broadcast, and causal broadcasts.

Flaws:

This paper seems like it is missing some primitives, broadcast is nice, but what about unicast? Say for instance you where performing a computation on a dataset large enough that it would be impractical for broadcast. Instead you might want to send parts of it to all the members of a process group. Their system does allow the programmer to specify the destination processes of a broadcast message and it is probable that that set could contain just one destination ( would that work? ), but this would add overhead not required in this case. The existence of unicast communication is at least implied. In section 3.5 it mentions that ISIS uses remote procedure calls as it's lowest primitive. If unicast did not exist, the only option would be to broadcast any return code to every machine.

Allowing a member of a process group to get it's relative rank in a process group would be useful. The i'th process could be configured to process the i'th chunk of data. This would be similar to the the MPI api.

There are many missing pieces in this work. For instance they do not mention any particular value for a timeout. It's always mentioned as 'reasonable'. There should be a whole body of work to calculate a timeout that scales from multiple processes running on the same machine to long-distance inter-site communications.

Timeouts would be especially important for the abcast primitive. The implementation mentioned in 4.3.1 states that a recipient will keep a copy of a message in a priority queue until it hears back from the sender. The rate for clearing that queue ( via a TCP connection ) would be 3 * max(RTT from sender to a member of the process group ). One mis configured network card causing slow traffic can add load to the entire system. Seems like it's resistant to complete failures, but not partial failures.

Would the cbcast primitive as described solve the vector clock example on Thursday? It does not appear so.

The performance section was abysmal. It is totally missing information about the inter-site version of their protocols. As the reliability of hardware has improved since the sun2 days, it would be interesting to know if the benefits of replication still outweigh the overhead of these communication primitives.

It was disappointing that more time was not spent on inter-site communications. It has been a lacking feature from the beginning with grapevine.

This paper talks about implementation of fault tolerant system by using process groups. Process groups are groups of processes that are together participating in a distributed computation. In a process group, it is essential that each process knows any events that happen in the process group regarding the termination or addition of new processes. This way, any failure of process in the system can be compensated by other processes in the group (mostly on a different machine). The processes in a group communicate by broadcasting messages to all other members in the group. The paper mainly focusses on three primitives for broadcasting: ABCAST, GBCAST, CBCAST. The degree of serialization needed in some operations determines the type of broadcasting method to be used.
ABCAST which is atomic broadcast makes sure that updates reach all the nodes in the same order. CBCAST which is causal broadcast enforces stricter ordering by making sure that a predetermined order is being followed on all the nodes. Both the above methods when used in a synchronous manner will enforce strong consistency but causes bad performance. To avoid this, the paper talked about asynchronous broadcasting but just asynchronous messaging will cause inconsistency in the system. So the paper takes a hybrid approach by doing asynchronous message delivery and at the same time enforcing ordering by flushing the pending messages using the flush primitive. This provides a balance between good consistency and performance. The paper also gives a clear description of the checkpointing and recovery process. Once a node comes up, it sends recovery broadcast message which is received by a coordinator in the group which takes care of state transfer to the node and notifying other nodes after the recovery has completed.
Overall, I feel that the paper gives a good overview of the intricacies involved in designing a fault-tolerant distributed system. The issues involved with the different broadcasting methods are clearly listed along with the need for asynchronous communication to improve processing. It is difficult to point out any limitation after reading this paper as clearly improves in any of the aspects like consistency leads to degradation in other related aspects like performance.

The paper assigned for Tuesday by Birman and Joseph describes some low-level
communication protocols for fault-tolerant computation. These strategies
increase concurrency by ensuring that every host will apply updates in a logical
order, with failure notifications being included in that order.

One of the problems that motivates this work is that a machine cannot determine
in finite time after a failure of another machine whether it will receive a
message from the machine before it died or if the message was never sent.
Failing to distinguish between these two cases could cause an update to be
applied by some hosts but not by others. Therefore, for every failure that
occurs, each host would need to be aware of what happened and agree with all
other hosts on how to continue. This synchronous step is expensive.

To avoid running a synchronous protocol for every failure detected, events are
ordered relative to failures and recoveries. Since this ordering will be
followed by all other hosts in the system, no synchronous actions are needed to
prevent other machines from arriving at an inconsistent state through an
incorrect ordering. Thus, no waiting is needed to act on the receiving of an
update or notification of a failure. Some other contributions of this paper are
a systematic description and proof of atomic broadcasts to a subset of hosts,
not all of them, and an asynchronous protocol that increases concurrency which
implements Lamport's causal and partial ordering to aid in distributed
computations and data management.

The first pass of this paper was not fun, since I read section 4 before the memo
came out. One small flaw to point out is the lack of brevity, as nearly 30
pages probably were not needed to describe this idea in my opinion, and a lack
of any sort of performance numbers. Another tiny critique is that performance
numbers are very sparse, but perhaps this was done for brevity's sake. A
scalibility flaw is the synchronous nature of these protocols. Even the
asynchronous CBCAST needs to a synchronous flush periodically to avoid
inconsistencies. This is a necessary evil to achieve complete consistency in
distributed systems, but for a programmer to implement a service on top of these
broadcasts, she would have to be quite knowledgable to know when a flush would
be needed. I am also confused as to how a GBCAST is any different from a
synchronous protocol that reaches agreement on every failure, which was the
expensive protocol that partially motivated this work on p.48.

These techniques are applicable to achieving near-perfect consistency in
distributed systems, which is a desired trait of any distributed system today.
The synchronous GBCAST and ABCAST are suitable only for small systems, but
CBCAST is a good building block for a scalable and consistent distributed
solution that is tolerant of halting failures. It also further drives home the
CAP theorem, as a completely consistent system in this case is not completely
asynchronous, and it also blocks and loses availability in the face of network
partitions.