Reliable communication in the presence of failures
Kenneth P. Birman and Thomas A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems (TOCS), Volume 5 Issue 1, Feb. 1987.
Review due Thursday, 2/17.
« Practical Byzantine Fault Tolerance | Main | Epidemic algorithms for replicated database maintenance »
Kenneth P. Birman and Thomas A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems (TOCS), Volume 5 Issue 1, Feb. 1987.
Review due Thursday, 2/17.
Comments
Summary: Birman and Joseph present a set of three protocols--GBCAST, ABCAST, and CBCAST--designed to maintain consistency in distributed systems in the presence of failures while allowing for asynchronism and improved performance.
Problem: Communicating information between groups of processes require communication protocols that tolerate process halts, i.e. the stopping of a process without any faulty actions taken by the failed process. Enforcing an ordering of messages received by each process in a group is important to ensure that all processes see the same ordering relative to the failure of a process and perform consistent actions. For example, an update operation issued by process A should be received and handled by all other processes in the group before a message indicating the failure of process A is handled by the processes in the group. Without such an ordering, some processes may recognize process A as failed and not commit the update, while others would commit the update and then recognize process A as failed. Birman and Joseph make the problem even more complex by desiring that communication be asynchronous to allow processes to benefit from increased communication parallelism.
Contributions: To address the problem, Birman and Joseph present the design and implementation details for a trio of protocols: ABCAST, CBCAST, and GBCAST. All three protocols are broadcast--messages are delivered to all processes in a group--and atomic--either all process in a group receive the message or no process receives the message. ABCAST is designed for situations in which data must be received in the same order at all destinations, even though the order is not predetermined. ABCAST is useful when a process group maintains replicated copies of a queue data structure. CBCAST is used to enforce a partial ordering on messages where causal relationships are significant and important for message delivery. Its enforced ordering is stronger than ABCAST, which only enforces an ordering on messages from the same sender, but CBCAST allows more asynchronism than ABCAST. Lastly, GBCAST is used to maintain a common view of which processes are in a group. GBCAST is used to send process failure messages or process recovery messages.
The contributions by Birman and Joseph differ from other works in that the protocols allow significant asynchronism. The CBCAST protocol ensures that once a message has been sent and the operation has performed locally, the process can continue, as the CBCAST protocol guarantees that all other processes will receive the message correctly and process it in the correct order.
Flaws: Implementing three different protocols is quite complex, especially when dealing with the interactions between them. GBCAST, for example, must ensure ABCAST and CBCAST message orderings are correctly preserved when GBCAST messages are delivered. There is also redundancy between ABCAST and CBCAST. It seems that ABCAST could be implemented using CBCAST by setting the clabel such that only ordering between two messages from the same source matters, and the ordering relative to all other message senders is irrelevant.
Applicability: Consistent message orderings within a process group is important for any distributed system which maintains some replicated state or data. For example, a distributed shopping cart system must ensure that two people can both purchase the last item; some global ordering needs to be enforced to ensure that only one purchase is "committed" on all shared processes. Furthermore, performance is important, and often the reason for pursuing a distributed design, for most distributed systems. Therefore, synchronous communication between processes needs to not be the bottleneck. Real world systems need to be able to send a message to other processes and move on.
Posted by: Aaron Gember | February 20, 2011 02:19 PM
Summary: Birman and Joseph describe a set of multicast protocols that have different trade-offs between performance and correct ordering that can be selected depending on the particular needs of a distributed system. These algorithms simplify the development of fault-tolerant systems, as demonstrated by the distributed objects and bulletin boards in the ISIS system.
Problem: Much of the potential performance in a distributed system comes from concurrent computations and communications between sites, but the results of concurrent communications may be worthless if inconsistencies are introduced between different nodes in the system. Users typically want high performance and require some level of consistency, so balancing message concurrency with delivery ordering is important. Furthermore, ensuring correct delivery order of messages can be highly error-prone, so it is desirable to have a communications layer that incorporates the necessary correctness properties on which developers can build higher-level applications. Previous work developed communications systems that reliably ordered broadcast messages, but they did not consider asynchronous algorithms that are necessary for high performance through concurrency.
Contributions: A key observation is that not all message orderings may be important to a particular application. Three broadcast protocols are introduced with different ordering guarantees. The most innovative of these protocols is CBCAST, based on the premise that only messages that may causally related need to be delivered in order - and then only when the ordering is feasible and the application deems it necessary. This notion of partial causality is familiar from Lamport's clocks, and users can further specify a partial ordering on messages to indicate which causal relationships are important.
These protocols are implemented on the ISIS distributed system, which the authors used to measure the performance improvements of their approach. Significant improvements are reported moving from synchronous to concurrent operations, as well as due to combining multiple concurrent messages in a single network packet. However, only a few measurements are presented and a sensitivity analysis is not performed.
Flaws: The correctness of the causal broadcast relies on the ability of the developer to determine the causal relationships between events. If message labels are used that violate causality, the CBCAST protocol decays into a system where no ordering constraints are observed. The authors acknowledge that determining these relationships concisely may be difficult. While these protocols hide some communications complexity from the developer, the key insight in this works still largely depends on the developer to be effective.
The protocol for ensuring consistent views of group membership between nodes assumes that nodes fail and recover relatively infrequently. With no measurements of the cost of this protocol, it is difficult to gage the failure rate that would be acceptable. But in a large datacenter environment where nodes may fail quite frequently, this could potentially be a problem if the algorithms have limited scalability.
Applicability: The ISIS system presents a few general applications of these protocols that would be highly relevant in many distributed systems, including managing data replication sites and choosing a node to direct a group of nodes performing a computation.
Posted by: Todd Frederick | February 21, 2011 10:30 AM
Summary: This paper proposes a few primitives for communication within fault-tolerant process groups which achieves both high level of concurrency and delivery order guarantees.
Motivation: Communication between interacting components in a distributed system can be a bottleneck. Unless well designed, we will run into problems of resource underutilization because of lack of concurrency among operations. For reliable operation, delivery order guarantees are important too. This paper attacks both the problems.
Summary of contributions: The paper proposes three broadcast primitives – the ABCAST , GBCAST and CBCAST. GBCAST (group broadcast) is for a process group and guarantees same delivery order on all sites wrt other broadcasts and failure of sender. The Atomic Broadcast ABCAST is necessarily a multicast. The CBCAST models causal communications and guarantees delivery order among other CBCAST messages. CBCAST messages a clabel, which embody causality information. All three broadcasts are atomic. CBCASTS, by nature of being (almost always) asynchronous, are the source of concurrency in the system.
A second contribution would be the thorough layer-by-layer description of how a distributed system can be realized using these primitives. Implementations for the primitives are discussed.
Flaws: The paper assumes only one operational cluster is present i.e. Does not deal with network partitions.
Evaluation criteria of the concurrent version of the protocol (of observing a calendar screen refreshing ‘essentially’ simultaneously, placing the blame on OS) leaves a lot more to be desired.
The primitives seem too specific to purpose of the system and is likely hard to generalize for other distributed computation models.
Discussion: These primitives provide a common language for distributed systems implementers. As the paper discusses a few use cases, depending on the various situations, we might go for different primitives. For cluster management, one could use GBCASTS, for DB updates CBCASTs are well suited, for distributed lock management ABCASTs could be used etc.
However, as I have stated earlier, it appears like these primitives have to be built into the underlying layer than at the application itself, which might prevent its general application.
Posted by: Srinivasan | February 21, 2011 06:46 PM
Summary: This paper develops a collection of communication protocols designed to enforce different invariants with regards to message ordering in the presence of halting failures. The authors first explain their conception of a distributed system as a hierarchical collection of clusters, then justify the goals and the mechanics of the protocols, and finally describe an implementation called ISIS.
Problem: While Byzantine fault tolerance would be nice, the authors of this paper recognize the need for a more practical and performant approach to fault tolerance. To achieve this, they allow themselves the leeway to make assumptions about network reliability and message correctness where justified and to employ the fact that a strict global ordering of messages is not always necessary.
Contributions: Birman and Joseph move the conversation beyond a homogeneous group of nodes with homogeneous messages. They realize that the reliability of communication amongst a tightly located cluster is likely to be vastly better than the communication between widely separated clusters. This leads them to view the system in a hierarchical manner, with differing techniques applied to each level to achieve and acceptable balance between performance and reliability. Likewise, they acknowledge that not all messages need all the same guarantees, and thus they present ABCAST to enforce a globally consistent ordering amongst a group of messages, CBCAST to maintain a partial ordering of all messages, and GBCAST to inform nodes about changes to the system topology.
Flaws: Though the authors claim there there exists a proof that deadlock and livelock are avoided in their system, it is not entirely clear what conditions must be true for this to hold. For example, if a network slowdown pushes the latency beyond the expected maximum levels, it would seem that the system could devolve into a swarm of continual "you are dead" messages being sent all around. Relatedly, is it possible that such a system could result in cascading failures in which the failure of some number of processes or sites can effectively doom the remaining processes? This issue does not seem to be directly addressed.
Applicability: Most complex systems that we employ today require some amount of tuning to realize near-optimal behavior, and this paper seems to be a pioneer in the acknowledgment that the nature of our protocols should be heavily based on the structure of the system and the nature of the communication between nodes. That is, a one-size-fits-all approach to fault tolerance (as we saw in the approach towards the Byzantine generals) will inevitably result in an impractically and unnecessarily slow system. This research demonstrates that we are better served to employ reasonable assumptions and application-specific requirements in order to reduce the burden of accounting for every possible circumstance.
Posted by: Rich Joiner | February 21, 2011 07:11 PM
Summary: This paper introduces us to the concept of fault tolerant process groups which are very similar to replicated state machines. The idea discussed throughout the paper is about the design of a set of communication protocols which allows the system to remain in a consistent state even in presence of halting faults, which are process terminations without introduction of incorrect behavior. The three protocols described are GBCAST, ABCAST and CBCAST. They allow ordering of messages relative to failures or recoveries and also relative to other messages flying around in the system.
Problem: In a distributed environment, a set of replicated processes exchange messages to keep the system state consistent. But due to independent process failures, a portion of the processes can become inconsistent. This can be solved using synchronous agreement protocols in events of failures but this can be very slow. Other means could be to discard messages from failed processes, but there is no guarantee that all the recipient processes would be aware of the failure and would discard any received message.
Contribution: The approach described in the paper includes use of broadcast mechanisms that order messages relative to failure or recovery events such that the all the processes have a consistent view of the system state and would perform same operations in same sequence in logical time. This order need not be total order, only partial order is sufficient. Each process maintains a local copy of the clusters view. Among the operational members there is one view manager who is responsible for sending GBCASTs in event of process failures/withdrawal and addition of new processes to the other operational members to update their view of the system. ABCASTs can be sent by any process to other processes when it wants to make sure that all the other processes perform the required updates in the same order. CBCASTs like ABCASTs are sent to maintain order, but in this case the order is maintained across different update operations which could be causally related. Unlike GBCASTs and ABCASTs, CBCASTs are always invoked asynchronously thus increasing the concurrency in the system.
Flaws:
Applicability: This work is highly useful for design on fault tolerant distributed systems that try to achieve high levels of consistency in system state without compromising concurrency using asynchronous means of communication. Systems like Amazon Simple DB which is a highly available key-value store can be implemented using protocols like the ones described in the paper.
Posted by: Paras | February 21, 2011 08:11 PM
Summary
In this paper, the authors present a set of broadcast primitives for fault-tolerant communication in a distributed system. Using three primitives, they are able to have asynchrous communication and high concurrency.
Problem
Communication in a distributed system must be robust to failures and changing membership. It is necessary for nodes to have some protocol to allow concurrency while maintaining consistency. In particular, maintaining consistency in the face of failures (in this paper, halting failures), can be challenging. Synchronous agreement protocols exist, but limit concurrency and can be expensive in practice.
Contributions
The authors use the model of process groups, rather than broadcasting to the entire set of processes. They introduce 3 broadcast primitives. Group broadcast (GBCAST) guarantees that all GBCASTs are delivered in the same relative order at all destinations and that failure GBCASTs be delivered after all other messages from that sender. These are used for messages that convey changes to the global property of the group, like membership changes or failures. Atomic broadcast (ABCAST) guarantees that all destinations receive the message or none do, and that messages will be delivered in the same order to all destinations. This behavior is useful, for instance, for a replicated queue, where the order must be the same for consistency. Causal broadcast (CBCAST) further guarantees that broadcasts are delivered in the same (pre-determined) order at overlapping destinations. The authors demonstrate how a computer cluster can implement communication using these abstractions in a way that allows for concurrency in most cases and still guarantees consistency.
Flaws
The systems described in this paper are restricted such that in the case of failure, there is only a single set of operational sites. However, this precludes the possibility for two partitioned but internally consistent sub-clusters. While this case is certainly more difficult to handle, it also seems likely that distributed systems may want a mechanism to keep more than one active partition in the case of failures. Also, in light of the Byzantine failures we have read about previously, it seems limiting that the paper is restricted to halting failures only. Certainly dealing with other failures is complicated and perhaps completely beyond the scope of this work, but a mention or reference to work that addresses these other failure models would have made for a more complete argument.
Relevance
The systems for which such an approach applies, specifically replicated state machines, are of great importance in distributed computing. This paper supplies a set of protocols that can be used to maintain consistency and concurrency in a straightforward model. The use of asynchronous communication, in particular, can be useful in these systems.
Posted by: Emily Jacobson | February 21, 2011 09:11 PM
Summary:
This paper designs and implements a set of communication primitives for supporting distributed computations in an environment where failures could occur.
Problem
Since increasing concurrency generally improves performance in distributed system, authors try to propose a set of communication primitives to fulfill the following requirements:
a. respecting event-ordering constraints specified by the computations in each process;
b. simplifying higher level code;
c. improving the efficiency of distributed system.
Solution
Authors consider distributed system is composed by a collection of processes possessing local states and communicating by messages. The communication networks are structured hierarchically into clusters of local sites that do not experience internal partitioning, interconnected by long-haul communication links, which may fail but can be reestablished rapidly. The failure authors consider are halting failures, which means that a process stops executing without performing any incorrect actions. Processes use time out to detect the failure of other processes.
Under the above assumption, authors propose definition of fault-tolerant processes group, which consist of a collection of processes that are cooperating to perform a distributed computation. Authors formalize the behavior of fault-tolerant process group by defining three broadcast primitives:
Group Broadcast(GBCAST): GBCAST satisfies the following ordering constraints: First, the order in which GBCASTs are delivered relative to the delivery of all other sorts of broadcasts is the same at all overlapping destinations. Second, a failure GBCAST is required to be delivered after any other messages sent by the failed process.
Atomic Broadcas(ABCAST): the order in which data are received at a destination must be the same as the order at other destination, even though this order is not determined in advance.
Causal Broadcast(CBCAST): Besides order in which broadcasts are received in the same order at overlapping destinations, CBCAST guarantees that the order is the same as some predetermined one. The ordering properties of CBCASTs can be defined as: a) if B precedes B’, the same process p sends B before it send B’; b) if B precdeds B’, B is delivered at SENDER(B’) before B’ is sent.
Flaw:
1. authors only consider halting failures. But in distributed system, a abnormal or malicious process may send messages to other processes. A robust methods should be proposed to handle these malicious processes.
2. time-out is used to detect the failure of processes. When status of network changes, time-out may report failure wrongly. And time-out cannot detect the emergence of non-halting failures.
Applicability:
Primitives proposed in this paper focus on how to keep consistency of message received by processes in a group. Consistency is reflected in three aspects: 1) messages are received by all processes in the group; 2) messages are received in the same order; 3) orders are equal to the predetermined order. Consistency is key property to keep a distributed system running correctly.
Primitives proposed in this paper are abstract of consistency requirement in distributed system. These primitives isolate consistency implementation from specific communication protocol, and simplify higher level codes.
Posted by: Linhai Song | February 21, 2011 09:43 PM
Summary
This paper deals with communication between processes in a fault-tolerant process group. 3 types of broadcast protocols are proposed and implemented which allow for globally consistent message delivery and ordering, even when individual processes may exhibit halting failures and the underlying network is asynchronous and may drop or reorder messages.
Problem
In many distributed systems, it is desirable that nodes agree on a consistent global state of the system, and that changes initiated at one node are reflected at all other nodes in exactly the same way. In a real-world system one often encounters halting failures where a node or a network link fails. Furthermore, real-world networks are asynchronous and may drop messages or reorder them, and in general provide no inherent guarantees about the reciept or ordering of sent messages. It is desirable to have messaging protocols which provide certain guarantees about ordering and atomicity of broadcasts (i.e. a message is recieved in the same order at all live nodes, or it is not received at any). In this formulation, the failure of a node is also a sort of message, and all remaining nodes should "notice" the failure in the same order relative to messages from that node.
Contributions
The paper introduces 3 types of broadcast messages: GBCAST (group broadcast), ABCAST (atomic broadcast), and CBCAST (causal broadcast). Note that all protocols broadcast to a group, and all are atomic (they all deliver messages to all processes in a group or none of them). GBCAST is used to maintain views of process groups, which encapsulate membership in the group and the group's global properties at a particular logical time (process failure and recovery messages are sent via GBCAST). ABCAST is used in situations where the ordering of recieved messages must be the same at all destinations, but the order need not be determined in advance. CBCAST is used where an application's semantics require some delivery ordering to be enforced (although it need not be a complete ordering; CBCAST allows the clabels used for ordering to be comparable or non-comparable at the application's discretion). Synchronous and asynchronous uses of the broadcast mechanisms are allowed.
Implementations of the protocols, including the management of site views and a process monitoring service, are detailed. The correctness of the protocols in the face of halting failures is shown. Some optimizations for performance, which nonetheless maintain correctness, are given. Additional optimizations for multiple sites connected over a WAN are also given.
Applications using the broadcast protocols, including replicated fault-tolerant program objects called resilient objects, updates of replicated data, lock management, and coordinator-cohort computations on the ISIS system at Cornell are discussed.
Flaws
The authors note that their protocols have a tendancy to block in the event of a network partition. They also note that if failures are very frequent, it is possible that the site view protocol (implemented with GBCASTs) never terminates.
Another important consideration, not mentioned in the paper, is that the protocols are only tolerant of halting failures. They assume that all nodes implement the protocol correctly and simply stop responding in case of failure. Certain Byzantine failures could, however, break the protocol and introduce inconsistencies. For example, in the view management protocol, a site experiencing a Byzantine failure could acknowledge two contradictory views from two different view managers and prevent the other nodes from ever agreeing on a view, which in turn would prevent the normal operation of any of the messaging protocols.
Finally, as a practical consideration, efficient use of CBCAST requires that an application developer carefully think about what the actual potential causal relationships between messages are. It is easy to make a mistake and under-constrain ordering (possibly leading to errors), or conversely to over-constrain it (leading to suboptimal performance).
Applicability
Consistent, atomic messaging protocols have many useful applications in distributed systems, as the authors demonstrated in ISIS. In particular, it is very useful to have such protocols which can be used to keep a shared state consistent across many nodes, even in the presence of failures and an unreliable network. It is important to realize the limitations of the presented protocols with regard to Byzantine failures, however.
Posted by: Craig Chasseur | February 21, 2011 10:01 PM
Summary:
This paper explains a few techniques to provide for fault tolerant communication in a distributed system using a family of reliable multicast protocols.
Problem:
The paper aims at solving the problem of providing a communication mechanism in a distributed computing environment where there is a possibility of failure. The paper mainly aims at handling halting failures. The communication mechanism that is recommended is designed with the goal of providing an event ordering of the system to a group of processes, which form a fault tolerant process group. Such an event ordering could help in making immediate updates and in quick recovery from failures.
Contribution:
One of the primary contributions provided by this paper is a logical approach to failure detection. Instead of detecting failures using timeouts, the system uses a protocol by which a process reaches an agreement with other processes regarding the occurrence of a failure event. One of the benefits of this approach is that once a process detects failure it will not communicate with any other process that is not consistent with this failure information.
The provision of broadcasts within fault tolerant process groups is another important contribution. This helps in ordering changes to the group properties with respect to the current broadcasts in the group. The broadcast primitives are highly concurrent and work in local and wide area networks. The broadcast primitives that are considered are group broadcasts, atomic broadcasts and causal broadcast. The group broadcast protocol is useful when it is needed to inform members of the group regarding the failure, recovery or withdrawal of another member. It helps in maintaining the global snapshot of membership and the global properties. Atomic broadcasts are useful for applications where the order of arrival of data at a destination must be the same as the order in other destinations. Causal broadcasts are used in those applications where the outcome of one broadcast affects the outcome of the other, in other words they have a causal relationship.
The other important contribution is that the failure detection techniques provided for local area networks can be implemented for wide area networks. They do it by letting a process that monitors an external process to get failure information from a monitoring unit in the external process’s’ cluster. They also make sure that any local process knows about the broadcast protocol before any external process. This is optimization which helps in picking a new coordinator in the same cluster as the old one is case of a failure.
Applications:
1.> This method of fault analysis is very useful, the ordering of events is very important in updation of replicated data. One must make sure that the updates occur in the same order across all the copies.
2.> Broadcasts with a fault tolerant process group is like multicasts, which is again useful in today’s networks.
Flaws:
1.> The system tends to block in the case of partition failures.
2.> The paper assumes the frequency of failures and recoveries will be restricted and that stability could be achieved after sometime. However this may not be the case.
Posted by: Vinod Ramachandran | February 22, 2011 01:20 AM
Summary:
This paper explains a few techniques to provide for fault tolerant communication in a distributed system using a family of reliable multicast protocols.
Problem:
The paper aims at solving the problem of providing a communication mechanism in a distributed computing environment where there is a possibility of failure. The paper mainly aims at handling halting failures. The communication mechanism that is recommended is designed with the goal of providing an event ordering of the system to a group of processes, which form a fault tolerant process group. Such an event ordering could help in making immediate updates and in quick recovery from failures.
Contribution:
One of the primary contributions provided by this paper is a logical approach to failure detection. Instead of detecting failures using timeouts, the system uses a protocol by which a process reaches an agreement with other processes regarding the occurrence of a failure event. One of the benefits of this approach is that once a process detects failure it will not communicate with any other process that is not consistent with this failure information.
The provision of broadcasts within fault tolerant process groups is another important contribution. This helps in ordering changes to the group properties with respect to the current broadcasts in the group. The broadcast primitives are highly concurrent and work in local and wide area networks. The broadcast primitives that are considered are group broadcasts, atomic broadcasts and causal broadcast. The group broadcast protocol is useful when it is needed to inform members of the group regarding the failure, recovery or withdrawal of another member. It helps in maintaining the global snapshot of membership and the global properties. Atomic broadcasts are useful for applications where the order of arrival of data at a destination must be the same as the order in other destinations. Causal broadcasts are used in those applications where the outcome of one broadcast affects the outcome of the other, in other words they have a causal relationship.
The other important contribution is that the failure detection techniques provided for local area networks can be implemented for wide area networks. They do it by letting a process that monitors an external process to get failure information from a monitoring unit in the external process’s’ cluster. They also make sure that any local process knows about the broadcast protocol before any external process. This is optimization which helps in picking a new coordinator in the same cluster as the old one is case of a failure.
Applications:
1.> This method of fault analysis is very useful, the ordering of events is very important in updation of replicated data. One must make sure that the updates occur in the same order across all the copies.
2.> Broadcasts with a fault tolerant process group is like multicasts, which is again useful in today’s networks.
Flaws:
1.> The system tends to block in the case of partition failures.
2.> The paper assumes the frequency of failures and recoveries will be restricted and that stability could be achieved after sometime. However this may not be the case.
Posted by: Vinod Ramachandran | February 22, 2011 01:22 AM
Summary
This paper discusses communication primitives,their working and correctness to provide consistency in a fault tolerant distributed system where failure is a norm.
Approach:
The main idea of the paper is that - every distributed computation can be represented as set of events and there exists a partial order between these events. By enabling every process to deduce the event ordering that will be observed by other processes permits to carry out distributed computation in a consistent way.
Description of the problem.
In a distributed system components fail. There may be a possibility of different systems detecting different order of events performed by the failed components, given that failing is also considered to be an event. For suppose a failed event of a component may reach another component before any other messages sent by the failed component before it failed. Two approaches can be followed here 1) Operational components can run a agreement protocol and decide on the sequence of the events to be carried on - but this approach is very slow due to its synchronous nature and that every component has to run this protocol. 2) Discard the messages once a component learns about failure of a sender - but this can introduce inconsistency in the system since different components see different ordering of the events. In this paper Birman and Joseph talk about a broadcast protocol that order messages relative to the failure and recovery events (using the causal relationship idea) such that all the replicated systems see the same sequence of events. Also previous approaches didnot consider fault tolerant applications that are highly asynchronous and concurrent in a systematic way.
Assumptions
1) Failures are halting. They have fail stop behaviour and all the data is lost at the failed site 2) Failure of a process at a component is generally detected using timeouts or even if the component is operational after a process fails - it is assumed that all the interested parties are notified about this failure. 3) Protocol blocks in situations of partitions 4) Network is generally hierarchical where clusters which are local do not experience any partitions and communication links between remote may go down but can be re-established easily. 5) All the broadcast messages are given a unique ID.
Contributions of the paper
1) I guess this is first paper among the papers we read till now to deduce a solution to a distributed system’s problem using layers of networking protocol stack. It provides entire communication subsystem - message transport layer , site view management, broadcast primitives, data structures and facilities for communication among process groups.
2) It takes a logical approach towards a failure - when a process detects failure rather than acting independently it tries to reach a agreement with other processes to order the failure event with other events ( - it gives a freedom to order events before and after the failure)
3) GBCAST : A broadcast protocol to maintain process group view by updating all the members about the group changes. Two properties of this primitive are : a) order in which GBCAST’s are delivered relative to other broadcasts is same at all overlapping destinations. b) failure GBCAST’s are delivered after all the messages of the failed process are delivered.
4) ABCAST : A broadcast protocol to deliver the messages having same label in same order at all the overlapping destinations. This protocol is mainly used to maintain a replicated data structure.
5) CBCAST: Above two primitives provide consistent ordering of events at all the destinations but some times this ordering has to match with a predetermined ordering. This primitive tries to order events with a causal relationship. This primitive is almost invoked asynchronously.
6) Coordinator/Cohort scheme to update state at recovering sites.
7) Provides a communication subsystem : a) Inter-site communication layer ; It converts halting failures and admissible communication failures into site view abstractions by having ack’s for every message sent, having a process monitoring system to detects process failures and periodic hello messages to detect site failures. b)site view management protocol follows a two phase protocol to to maintain a consistent membership information among all the processes.
Flaws in the paper
I dont see any major flaws in the paper. There is so much of information passing around because of this protocol. but they are required for the correctness of the of the protocol. Also they tried to optimise in most of the cases - they maintain SEND_TO flags to make sure messages are sent only once, pointers of the messages are sent rather than messages itself, clubbing all the messages in a single large message to reduce traffic.
The primitives presented here block in case of network partitions which may not be proper in some cases. Also malicious machines may some time take position of view manager by assigning themselves highest id and can disrupt the system. Also since this paper takes different approach for solving the problem adapting existing systems to this new protocol may take time or may even be unwelcoming.
Applications:
As the paper suggests these protocols can be used to for updation of replicated data and managing locks on replicated data. Also depending upon the application requirement only required broadcast primitives can be used.
This paper has got 1038 citation which shows the importance of the concepts in the paper.
Posted by: Pratima Kolan | February 22, 2011 01:54 AM
Summary:
The paper introduces a family of fault tolerant broadcast communication primitives namely A(tomic)BCAST, G(lobal)BCAST and C(ausal)BCAST for distributed system. These primitives obey desired ordering constraints among them and guarantee atomic delivery of messages to multiple processes. Paper describes how distributed system design can be simplified by providing a communication interface that provides synchronized communication in the system.
Problem:
Distributed systems use message communication to achieve fault tolerance, atomicity of operations and logical ordering of events in the system. These have been traditionally achieved by synchronously performing the operations since processes have little control over communication channel. This makes the system inefficient and also overly complicates the system design. The communication interface presented in the paper subsumes all the desired properties of message communication and thus simplifies the system design.
Contributions:
Paper describes how synchronization of operations in distributed systems can be achieved by underlying communication interface which provides synchronized message communication among participating processes. The communication guarantees provided by the interface allows processes to perform operations asynchronously as the messages are guaranteed to be reach all operational processes atomically and in strict order dictated by the primitives.
The paper also shows how integration of process monitoring service, which detects failures in the system, with the communication interface allows all the processes to have consistent view of the system. The notion of fault tolerant process group helps to keep message overhead in the system under check.
The communication interface uses multiple priority queues for ABCAST, and uses causality to constrain CBCAST delivery. This isolation allows these protocols to achieve high levels of concurrency and hence better performance. The notion of 'clabel' for partial ordering of related events at any process permits greater control to the programmer over message synchronization and can exploit greater concurrency.
Since the communication layer takes care of atomicity, synchronization and fault tolerance, system design becomes very simple.
Flaws:
The communication facility presented does not work in presence of network partition.
The system requires elaborate state information at each site e.g. PBUF buffers, view members of all views for garbage collection. Besides, the computationally expensive operations carried out, before a message can be put into a process' delivery queue, might not be appropriate for distributed systems where message traffic is too high due to greater synchronization operations.
Applications:
The communication primitives introduced in the paper can be used to implement any distributed system where process synchronization requirement is considerable. Distributed databases, replicated data servers, distributed resource sharing are few examples of distributed systems where such a communication facility is much desired.
Posted by: Sandeep Dhoot | February 22, 2011 02:31 AM
Summary:
This paper presents three main primitives (Group Broadcast, Atomic Broadcast, and Casual Broadcast) for clusters of local sites to both tolerate halting failures and attain high levels of concurrency and consistency.
Problem:
Consider a process broadcasts its update of replicated data to all data managers and fails immediately. Since the delay of messages, some data managers may receive the update and detect the failure, while others may not receive the update before detecting the failure. To address this problem, synchronous methods guarantee the consistency but performance is bad. Discarding messages received after failure detection is efficient but inconsistencies arise. We require an efficient way to make sure the update is performed on all data managers atomically to maintain a consistent view of data.
Contributions:
This paper uses three broadcast primitives to ensure that each process in the process group has the consistent order of events such that data managers can take appropriate actions immediately without coming to an agreement with others.
1. GBCAST is used to inform group members whenever group status is changed. Usually, every process caches group membership and updates it when it receives GBCAST messages. Synchronization mechanisms such as read lock and flush primitive are used to guarantee GBCASTs are ordered relative to other broadcast messages and failure GBCASTs are delivered after every message from the failed process.
2. ABCAST maintains priority queues for each process to buffer messages before delivering them. For every message, through two broadcasts from the sender to recipients, all recipients assign the same priority to the message so that messages have the same order at each recipient.
3. CBCAST is used to ensure the casual related messages are delivered in the desired order: A buffer is utilized in each process to record the copies of messages processed by it. Before transmitting a message, all messages in the buffer that precedes it are sorted and formed a transfer packet to transmit to the recipient.
Flaws:
1. It may not scale well in wide area network because the co-operations among three broadcast primitives are too complex and produce lots of copies of messages. Congestion control and flow control in the wide area network require more optimizations for these primitives.
2. To further tolerate Byzantine errors, it seems hard to extend this paper to support that. For example, for site view management, the view manager requires all the acknowledgments from sites are positive to commit the current site view. If one site is a liar and replies negative acknowledgments with wrong change, view manager will behave incorrectly.
Applicable:
This paper can be well applicable to replicated state machines in local cluster because it avoids expensive synchronizations for fault tolerance by enabling the same total ordering and casual ordering of events in each process.
The CBCAST and ABCAST are extended in the ISIS toolkit, which is used in many applications like New York Stock Exchange.
Posted by: Weiyan Wang | February 22, 2011 02:38 AM
Summary:
The paper extends the notion of Causal and Total Ordering as applicable to synchronous systems over asynchronous communication to achieve fault-tolerance, high consistency and concurrency.The paper achieves this by introducing three fault-tolerant reliable multicast primitives viz CBCAST (Causal multicast), GBCAST and ABCAST (totally ordered multicast) that guarantee atomic delivery of messages to all the processes in a process group.
Problem:
To me, The problems that the paper tries to address are two-fold
1) The notion of process-groups (introduced by Cheriton et al, 85) as a basic building block for distributed systems lacked ordering or reliability properties. (as seen in V). V wasn't fault-tolerant as well.This paper aimed at extending this notion of process-group by adding communication reliability and ordering properties and demonstrate the ability to support fault-tolerant groups.
2)The existing systems tried to achieve some degree of fault tolerance and consistency by performing the operations synchronously - which was highly inefficient and complex. This paper aimed at simplifying such a design without compromising on consistency.
Contributions:
One of the fundamental contributions of this paper (and ISIS) which have had a huge impact on the virtual consistency model is its logical approach to failure detection (as opposed to the timeout mechanisms). The system uses a failure detecting service. Many systems developed during that period had some sort of failure detection modules but ISIS used its membership module throughout and one can argue this membership protocol to be some form of a fault-tolerant agreement (consensus) solution.
This paper also demonstrated that CATOCS systems with high degree of fault-tolerance are practical.
Flaws:
> The failure detection approach (later ported to ISIS)isn't tolerant to malicious behavior, any mistaken failure detection could force a node to dropout of the system and then rejoin. ISIS even allowed any process to eject any other process suspected as faulty.
Guess, malicious behavior wasn't their primary concern
>The paper just consider Halting failures and blocks for partitioning failures.
Applications:
> Was used in ISIS, the first view-synchronous group communication system.
>Can be used to implement consensus protocols
Posted by: Rohit | February 22, 2011 07:47 AM