« Distributed snapshots: determining global states of distributed systems | Main | Practical Byzantine Fault Tolerance »

The Byzantine Generals Problem

L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401

Review due Tuesday, 2/15.

Comments

Summary: The Byzantine Generals Problem involves a hypothetical set of geographically-separated army generals who must decide on a common battle plan, with the caveat that some generals are traitors who will actively try to sabotage the plan. Lamport et al. formalize the parameters of this problem, show the circumstances under which it can be solved, and relate it to the creation of reliable distributed systems.

Problem: Understanding the properties of the Byzantine Generals Problem is important because the problem reflects a very pessimistic view of failures in a distributed system. Rather than consider the myriad ways that communications may fail in such a system and possibly corrupt other nodes, we assume that some nodes are actively trying to take the worst possible actions with respect to the goals of the system. These considerations are important in a datacenter environment, particularly if inexpensive and failure-prone equipment is used, because many bizarre types of failure will occur in a large system simply because of the scale. Approaching the problem in a formal, abstract way is necessary because of the ease of being mislead by a casual analysis.

Contributions: Lamport et al. describe an algorithm for reaching a consensus among generals in the face of traitors where each general sends a message he receives to each other general and a majority vote is taken among all received messages. The problem can be solved if less than a third of the generals are traitors. This method assumes that generals can modify any message they pass along, which is a realistic assumption for a datacenter environment where nodes are only indirectly connected through network hardware that may itself fail or corrupt data. If verifiable signatures are used in this scheme, any number of traitors can be handled because loyal nodes will detect that a traitor has sent conflicting orders.

Flaws: The benefit of robustness against Byzantine failures with this algorithm comes at a high cost, as each node must send a large number of messages to every other node that in turn must be sent to other nodes in a highly recursive fashion. Although the authors claim that solving this problem naturally requires high cost, this reasoning may be somewhat premature given future work that suggests more efficient methods. Of course the authors could not predict the future, but their claim is still very loosely justified.

The method for handling arbitrarily many traitors using message signatures involves a still higher cost, particularly if malicious nodes are assumed so that cryptographically secure signatures must be used. This scheme would likely not be deployed in a datacenter environment due to the expense of performing public-key operations to generate and verify signatures, especially for the large number of messages in the current algorithm.

Applicability: For distributed system designers that can anticipate failure rates in their systems, this work provides a theoretical guideline for the maximum failure rate permissible in a system intended to withstand Byzantine failures. Besides homogeneous datacenters, another context where this guideline would be appropriate is in a widely distributed system such as Condor or volunteer computing where a small number of malicious users may be expected. Given that security is always a trade-off, this work would assist in choosing a trade-off in protection mechanisms meant to guard against malicious users.

Summary

This paper introduces the Byzantine Generals Problem and proposes a oral message algorithm and a signed message algorithm to solve it. Thus, it abstracts the situation that reliable computer systems handle malfunctioning components that give conflicting information to different parts of the system.

Problem

How does a reliable computer cope with the failure of one or more components that send conflicting messages to other components?

This paper abstracts this problem as a Byzantine Generals Problem. Thus, to solve the problem requires to meet the two conditions:
+ Condition A: All loyal generals decide upon the same plan of action. What if a traitorous general may send different values to different generals?
+ Condition B: A small number of traitors cannot cause the loyal generals to adopt a bad plan. How the generals reach a decision?

Contribution

+ The problem is abstracted in terms of a group of Byzantine generals try to agree upon a common battle plan by messengers, while some of the generals are traitors. The Byzantine Generals Problem is rephrased as meeting two conditions:
- C1. Any two loyal generals use the same value of v(i) sent from the i-th general.
- C2. If the ith general is loyal, then the value that he sends must be used by every loyal general as the value of v(i).
Thus, the problem is to consider how a single general sends his value to the others.

+ This paper proves that if the generals can send only oral messages, then no solution will work unless more then two thirds of generals are loyal.

+ For passing oral messages, an algorithm OM(m) is proposed to solved the Byzantine Generals Problem, and is proved to be correct when there are more then two thirds loyal generals. OM(m) is in essence kind of consensus decision-making, and requires a majority function.

+ To make the problem easier, this paper introduces signed messages -- A loyal general's signature can be verified and cannot be forged. An algorithm SM(m) is proposed to solve the problem, and be proved to be correct without putting any restriction on the ratio of loyal generals in all generals. SM(m) is in essence to detect a traitor by checking the consistency of the order a general sends.

+ Finally, this paper extends SM(m) and OM(m) to cope with Missing Communication Paths.

Flaw

One flaw of their solution is both algorithms proposed in this paper are expensive, because both requires message paths of length up to m + 1, which involves sending up to (n - 1)(n - 2) ... (n - m - 1) messages.

Although this paper explicitly figure this flaw in the conclusion section, I would prefer to see more discussion on how to adjust the proposed algorithms in the situation that we do not need very high reliability and consistency. This is the case in some highly interactive environment where highly fault-tolerant human beings can make good decisions, e.g., to refresh the same web page if some transient errors show up.

Applicability

- The signed message idea looks like nowadays digital signature when sending email ...
- At the end of this paper, it maps the Byzantine Generals Problem to real
distributed environment. The proposed solution is somehow applicable to a
small cluster that requires very high reliability.
- As pointed out in our second lecture on Jan. 20, in addition to conventional distributed system failures like fail-stop and Network failure, Byzantine generals problem captures some sort of failures that is hard to deal with -- malfunctioning components giving conflicting information to different parts of the system.

Summary
This paper presents two solutions to the Byzantine Generals problem. The first solution uses only oral messages, but requires that more than two-thirds of the generals are loyal. The second solution uses signed messages and will work for any number of traitors. The Byzantine Generals problem is an abstraction of a system with many communication (and perhaps malfunctioning) components.

Problem
The Byzantine Generals Problem is abstractly stated as: a set of generals must agree on a common battle plan using only messages to communicate; it is known that there may be traitors trying to sabotage the messages. The loyal generals must decide on the same plan of action. Moreover, the loyal generals should not be coerced into adopting a bad plan by the traitors. More concretely, a system must be reliable even with malfunctioning components.

Contributions
The authors formalize the problem into two interactive consistency conditions, using the situation where a general sends messages to his lieutenants, who carry out the order: 1) all loyal lieutenants obey the same order, and 2) if the commanding general is loyal, then every loyal lieutenant obeys the order he sends.

Using only oral messages, this problem cannot be solved with fewer than two-thirds of the generals loyal; thus, there must be at least 3m+1 generals and no more than m traitors. Using signed messages, a solution is possible with any number of traitors. Here, we assume that signatures cannot be forged and that anyone can verify the authenticity of a signature.

Flaws
Although the authors suggest that their solutions are optimal, they are still quite expensive. If reliability requires a performance sacrifice, the tradeoff may not always be worth it. Although the authors suggest that some messages may be condensed or eliminated, they do not elaborate on specific solutions. This is a theoretical paper, so we may expect a lack of practical details, but it would be interesting if they were to comment on whether they believed cheaper solutions would be feasible with future research.

Relevance
In Section 6, the authors discuss how the solutions they present can be generalized to a real system. These algorithms capture solutions for an interesting failure case in large systems: incorrect or conflicting inputs. Systems are often designed to be robust to failures, but being resilient in the face of malfunctioning units is important as well. In general, the authors suggest that the algorithms work as described in the abstract, with failures existing at the processor or communication line. In the “signed message” setting, one can rely on fixed communication lines rather than switches to ensure direct communication. Timeouts can be used to determine if messages are absent; this also includes necessary clock synchronization. Thus, these algorithms can be used and extended by a variety of distributed systems to be robust to failures and malfunctioning unts.

Summary:
This paper outlines the Byzantine generals problem and presents several solutions to it based on different underlying assumptions, including the ability to use signed or oral messages and the connectivity of the communication graph. It concludes with a brief discussion of how this theory can be applied to real systems.

Problem:
The basic Byzantine Generals problem posits a number of generals, some of whom may be traitors, surrounding a city and attempting to decide on a unified course of action while only being able to communicate via messenger. This problem is analogous to that of a number of processes within a distributed system, some of which may be faulty or compromised, attempting to reach consensus on some aspect of a system.

Contributions:
As mentioned in the summary, the different algorithms for solving the Byzantine generals problem under different conditions really are quite useful. By dealing with a fairly simple base case first (oral messages, fully connectivity) and increasing complexity (signed messages, not fully connected network), the paper establishes the inherent limitations of any solution and gives the reader a fairly clear understanding of the issues involved in solving the problem. Furthermore, this approach clarifies the boundaries of what can and can’t be solved in a given situation (for the basic oral messages situation, fewer than a third of the generals can be traitors; with signed messages, all but two can be traitorous). The concluding discussion of how to apply the algorithms to real computing situations, while brief, is also useful in providing a practical basis for the paper.

Flaws:
On the whole, the paper is quite useful as a foundational work. Some of the practical sections could be more robust, however; A2’s demand of fixed networks (rather a more standard packet switching approach) is overly stringent (though it does note that A4 overcomes this). Some of the results are less than encouraging (for instance, the necessary connectivity of loyal generals), but this is hardly the paper’s fault, given the parameters of the problem.

Applicability to real systems:
On the whole, this paper seems to provide a useful foundation for future work on dealing with Byzantine failure. It does a good job of outlining precisely where the limits of potential solutions lie, as well as provide solutions that approach optimality (the not completely connected graph algorithm had not been proven at the time of writing). Further work needs to be done examining when and how these kind of solutions should be implemented, given their cost, and whether any optimizations exist that can make the algorithms more efficient.

Summary: Lamport discusses the interactive consistency constraints for the Byzantine Generals Problem. Procedures are presented and proven sufficient for solving the problem in complete communication graphs with unsigned messages and signed messages and in non-complete communication graphs with unsigned messages.

Problem: The problem the paper seeks to address is reaching consensus in the presence of faulty, or potentially malicious nodes. Distributed systems cannot be easily guaranteed to have all nodes properly functioning at all times, so decision processes that require uniform action amongst all properly functioning nodes must be able to cope with incorrect information from malfunctioning nodes. Lamport presents this problem in the context of the Byzantine Generals Problem. The problem requires that 1) all loyal nodes must obey the same order and 2) if the commanding node is loyal, then every other loyal node obeys the order the commanding node sends. The problem is challenging because it requires reasoning about which of many messages are correct, in some cases without being able to verify the authenticity of a sender of a message.

Contributions: Lamport's first contribution is a distributed algorithm for reaching agreement in the presence of a complete communication graph with unsigned messages and at least 2/3 of the nodes functioning properly (i.e. Lieutenants being loyal). Each of n Lieutenants acts as a commander and sends their message to the other n-1 Lieutenants. These n-1 Lieutenants then act as commanders and send their message to the other n-2 commanders. The process continues for k iterations until k = m, in which case each Lieutenant sends their message to all other Lieutenants. In the end, the Lieutenants pick the message which they have received the most. The process assumes each message is a binary yes/no. This mechanism allows the loyal Lieutenants to effectively reach a uniform decision. Lamport extends this algorithm to use signed messages, which works with any number of traitors.

Most importantly, Lamport extends the use of unsigned messages to a non-complete communication graph, i.e. every node cannot directly communicate with every other node. Although IP routing typically allows all nodes in a distributed system to communicate with all other nodes, this requires significant amounts of traffic to travel between nodes in different racks. A large distributed system will see lots of traffic in the core network. With a non-complete communication graph, messages can be primarily within a rack of machines, with messages communicated to other racks only when there is a need for have the appropriate number of neighbors to make a p-regular graph. Lamport proves there can be up to m disobedient nodes in a 3m-regular graph of message exchange.

Flaws: Lamport makes some strong assumptions on the messaging system, most notably the fact that the absence of a message can be detected. This can be difficult to achieve in real network where packets may be delayed and then delivered later. One must carefully construct mechanisms to avoid delayed messages interfering with later decisions. Ideally, the signed approach is used, and a time-specific nonce is included.

Applicability: Dealing with ill-behaved nodes is important in distributed systems that make use of low-end hardware and anticipate frequent failures. In these systems, the software must be capable of dealing with faulty nodes and making correct decisions even in the presence of incorrect information.

Summary:

This paper presents an algorithm for ensuring that all nodes in a distributed system have an equivalent view of system state in the presence of faults. The algorithm achieves this by reliably flooding messages throughout the system in manner that allows each node to come to the same conclusion about the state of the system. It makes no guarantees about whether the system state is correct or meaningful.

Problem Description:

This paper addresses the problem of distributed agreement in the presence of faults. It does this through the lens of the Byzantine Generals problem. The problem is basically a commander sending an attack or retreat message to lieutenants under the suspicion that some of the lieutenants or the commander could be traitors. The traitors correspond to nodes in a distributed system that are faulty or have been compromised by a malicious agent. This problem is important to solve because it is incredibly difficult to complete meaningful work in a distributed system if individual components can not reliably agree on the state of the system.

Contributions Summary:

The major contribution is that a system with 3m+1 nodes can reach a distributed agreement in the presence of m faulty nodes. Here faulty means the nodes can possibly not respond to queries from other nodes or they can respond with incorrect messages. The agreement in this configuration can be reached using the OM(m) algorithm that basically results in every node recursively relaying messages from other nodes to the rest of the system to establish a common ground about a single set of messages from one node in the system.

The OM(m) algorithm is extended to SM(m) to introduce signed messages to relax the size of the system to m+2 nodes. In this algorithm, each node can verify whether a message relayed through another node was actually generated by the original node. This signature information in the messages allows a solution to the Byzantine generals problem with fewer nodes.

Shortcomings:

The early discussion in this paper makes many assumptions about the guarantees provided by the underlying system, and most of these assumptions are addressed in a later section.

In the context of the OM(m), the authors assumed that the identity of each sender can be verified by the receiver. However, in the presence of faulty nodes, this can only be guaranteed if signed messages are used. This means that OM(m) really doesn’t have any advantages over SM(m) in terms of message complexity savings because OM(m) doesn’t provide a solution to the Byzantine generals problem in a real context.

Additionally, both OM(m) and SM(m) are very expensive in terms of number of messages sent. It seems that these algorithms can only realistically be applied when the system has a high probability (for some value of high) of a Byzantine fault.

Application to real systems:

This paper provides some intuition about the requirements for building a reliable distributed system that is tolerant to Byzantine faults. Specifically, it seems to present a tradeoff between adding redundant nodes for reliability and signing each message to verify message integrity. I think that choosing one design over the other would depend on the cost for extra nodes as compared to the added communication and computation cost for sending signed messages.

The Byzantine Generals Problem

Summary
The authors examine the Byzantine Generals Problem and give some algorithms to solve it.

Problem
This paper addresses the difficulty of getting working components in a distributed system to agree when some of the components are faulty and can send deceptive messages.

Contributions
The authors present the Byzantine Generals Problem as an abstraction for the problem. Here traitors represent faulty components. The authors show that no solution works unless more than two thirds of the generals are loyal, and they give an optimal algorithm for this case. They also give an algorithm that uses unforgeable signatures to cope with any number of traitors. Finally, they give modified versions of their algorithms which solve the Byzantine Generals Problem where the communication network is not fully connected.

Flaws
I felt that the discussion about approximate agreement was useful, however it did not consider other problems that could be considered approximate forms of the Byzantine Generals Problem. For example, one could have an algorithm where each component sums all the inputs it receives and limits the amount that one input can sway the final result.

I also felt that the definition of regular set of neighbors was not as clear as it could have been, especially since it is not clear if this concept is related to regular graphs.

Discussion
The Byzantine Generals Problem is unlikely to go away any time soon since we will probably always have components that can fail or be compromised. However, the problem may not be that important in real systems. Using error detection can often detect if a component is faulty. It may be useful to know when a component is faulty rather than just guaranteeing that working components agree. Instead of having every component communicate to every other one, it may be more efficient to have a central, reliable and trusted leader.

Summary:

The paper provides a solution which uses majority voting to deal with faulty components in computer system, which provide conflicting information to different parts of the system. The original problem of dealing with uncontrollable faulty components is represented in terms of Byzantine Generals Problem to arrive at an algorithm that allows loyal generals to decide upon a common and right plan of action in presence of traitors who can lie.

Problem:

Reliable computer system must operate accurately in presence of its component failures. Simple broadcast strategy cannot be employed as it cant be enforced that every message, in a broadcast, sent by faulty component reveals its faulty behavior to every node in the system. The authors extend the idea of majority voting to be applicable in an environment where faulty components can cause vote collection to be unreliable to a certain extent.

Contributions:

The formulation of Byzantine Generals Problem makes it easy to reason about viable solutions and sources of unreliability in a reliable computer system. The authors elegantly simplify the problem further by describing it in terms of interactive consistency where an order from one commander is reliably communicated to and interpreted by other generals in presence of traitors. The algorithms OM and SM presented in the paper deal with unreliability out of conflicting behavior as well as universally bad behavior by nodes in a system.
Lamport shows that in absence of message signing, less than one third of components can be allowed to fail to continue to behave properly and reliably. The recursive solution OM(m) shows how one can deal with failures in a system by using collective knowledge of the failure in the system.
Lamport also formalizes minimum connectivity necessary to solve OM(m) Byzantine Generals Problem and shows that the connectivity graph has to be 3m-regular in presence of m traitors with at least 3m+1 nodes in the system.

Flaws:
The OM algorithm presented in the paper requires message communication exponential in number of traitors expected in the system. As the total size of system and expected number of traitors in the system grows, the number of messages required to be exchanged grows significantly. The overall delay arising out of such slow, synchronized communication limits the applicability of the proposed algorithms.

Applicability:

The algorithms can be used in distributed decision making problems where every member has only partial view of the system and majority voting can afford an acceptable solution in presence of faults.

Summary:

In computer systems, a malfunction component may send conflicting messages to different parts of the system. A reliable system should be able to handle this malfunction component. Authors make two models for this problem, and provide related algorithms to solve this problem under some necessary assumption.

Problem:

Malfunction components may send conflicting messages to different parts of a computer system. In a reliable system, all other normal components should be able to avoid interference from malfunction components and make some agreement.

Authors model this malfunction component problem as a group of generals with their troops around an enemy city. Communicating among generals only by messengers. Generals must agree upon a common battle plan. But one or more traitor generals try to confuse others. Author define Byzantine General Problem as :

A commanding general must send an order to his n - 1 lieutenant generals such that:

IC1: All loyal lieutenants obey the same order;
IC2: If the commanding general is loyal, then every loyal lieutenant obeys the order he sends.


Solution:

Firstly, authors give out 4 assumptions before they propose specific algorithms:

A1: every message that is sent is delivered correctly;
A2: the receiver of a message knows who sent it
A3: the absence of a message can be detected
A4: (a) A loyal general’s signature cannot be forged, and any alteration of the contents of his signed messages can be detected;
(b) Anyone can verify the authenticity of a general’s signature

Authors describe the situation under assumption A1, A2 and A3 as messages sent as oral messages. Authors propose OM(m) algorithm, which solves the Byzantine Generals Problem for 3m+1 or more generals in the presence of at most m traitors.

Authors describe the situation under assumption A1, A2, A3 and A4 as messages sent as signed messages. Authors propose SM(m) algorithm, which solves the Byzantine General Problem if there are at most m traitors.

Authors extend these two algorithm when the assumption that a general can send messages directly to every other generals are removed. Authors also explain how the four assumptions are reasonable in real system.

Flaw:

1. if processors are pc, and they are connected by Internet or LAN, the four assumptions made by authors are too strict. More detailed and robust mechanisms are needed to make these four assumptions more reasonable.

2. Authors’ algorithms are based on majority vote, and different processors have the same weight. Maybe a probability model can be introduced and make different processors have different weight.

Relevance:

Authors analyze the feature of real system and abstract four assumptions, and under these four assumptions, authors’ algorithms can solve the Byzantine Generals Problem. Authors give out detailed proof for the validation of his algorithms. Authors make much explanation for why these four assumptions are reasonable, and I think this is also why their algorithm are effective.

Summary
This paper investigates the situation in the distributed system that some processors may be faulty and behave unpredictably to make the system unreliable. After making some assumptions, the author models this situation as the Byzantine Generals Problem and presents solutions to the problem under two settings of communication methods: oral messages and signed messages.

Problem
The main problem is Byzantine Generals Problem:
Generals are separated distantly. They can only communicate through messenger about what they observe. Based on the messages received from other generals, they should come up with a common plan of action. However, some generals may be traitors and they may send different messages to different generals to prevent loyal generals from making right decision. To handle this situation, an algorithm is needed to guarantee the following two conditions:
A. All loyal generals reach agreement on the same plan of action regardless of what traitors do.
B. A small number of traitors can't affect the loyal generals to make a bad plan.

Contributions:
1. The paper models one type of failure as the Byzantine Generals Problem and give a clear definition of it.
2. The paper considers the setting that generals communicate only through oral messages. It proves that given m traitors, no solution exists if there are at most 2m loyal generals. However, if more than 2m loyal generals are in the army, it gives an OM(m) solution to it. The main idea is to let every general acts as a commander to run OM algorithm recursively and finally use the majority value of all received values from other lieutenants.
3. The paper also considers the setting that generals communicate through signed messages. In this case, a SM(m) is given to handle the army with m traitors and any number of loyal generals. The main idea is by maintaining the message value in a set V, the algorithm guarantees all loyal general have the same set and use the same function choice to produce the same result.
4. The paper analyzes the problems faced by the algorithms in the real system and extends the OM and SM algorithm to handle a graph that is not fully connective.

Flaw:
1. Efficiency: OM and SM algorithm involves up to (n-1)(n-2)...(n-m-1) messages, which is impractical for large distributed systems.
2. In a workload-intensive environment, a nonfaulty node may become faulty or a faulty node may become nonfaulty in a short time. The algorithm may not work in this setting because it assumes all nodes won't change loyalty during the algorithm.

Applicable:
1. According to the wiki, the algorithms are applicable to replicated state machine. With non-cryptographic hashes, 2F+1 replicas are enough to survive all non-malicious Byzantine failures.
2. As discussed in the paper, the algorithms may have many practical problems in the real system such as clock synchronization problem and the problem that not all nodes can communicate with each other. Further investigations are required to handle these problems.

Summary:

This paper aims at solving the Byzantine Generals problem. It then aims at extending the solution to this problem for a reliable computer system with several processors that are used for computing the same result.

Problem:

Leslie Lamport et al provide a solution that aims at overcoming the harm caused by the presence of a malicious entity/traitor in a communication system. A faulty or failed component in such a system could cause problems by sending conflicting messages to different parts of the system.

The first such system they consider is that of a set of Byzantine generals that surround an enemy. The commanding general needs to convey the correct orders to his lieutenants for a plan of action. This requires the satisfaction of two consistency constraints:

1. All loyal lieutenant generals must obey the same order

2. If the commanding general is loyal then every loyal lieutenant obeys the commanders order

The second system they consider is a reliable computer system containing several processors. These processors perform the same task to compute a result and the correct result is chosen by a majority vote. Some of the processors in this system could be faulty causing malicious behavior and spurious messages. This problem can be mapped to the byzantine problem described above and same consistency conditions could be applied to it.

Contributions:

One of the main contributions in this paper is that it provides a solution to the byzantine problem for two types of messaging. They provide an oral message solution (OM solution) and a signed written message solution (SM solution). The oral solution is a recursive solution in which a commander sends messages (OM (m)) to his lieutenants, each lieutenant inturn sends the messages (OM (m-1) to n-2 other lieutenants….till OM (0). In each execution the message which is chosen by the lieutenant is the one which has the higher majority vote among all messages.

This solution makes some assumptions which prevent a traitor from interfering in the communication of two generals. It prevents the traitor from introducing spurious messages. It also assumes that the absence of a message is detected to protect from traitors that prevent a decision by simply not sending messages. The one bottleneck of this solution is that it needs at least 3m+1 generals in the system inorder to detect m traitors. This is because of the traitor’s ability to lie.

The signed written message solution provides for m failures with any number of generals. Here authors extend the above solution by using signed messages. These signed messages contain an unforgeable signature of the general. The signatures can also be verified for their authenticity.

They further extend this solution to more generic graph structures where they restrict the connectivity between the generals. However the principles remain as mentioned above. Likewise they extend the same algorithm mentioned above to a reliable computer system aimed at solving a particular task. The system has different processors which compute the result and perform a majority vote on their outputs to obtain a value.

Flaws:

1. The complexity of the OM solution is O(n2 * m) and that of SM is O(n*m2) which is pretty slow and could hinder the implementation of such a solution on a large scale.

2. The number of messages sent in this system is very large. It could be a major network impediment.

3. Is there a need for having multiple processors compute the same task and share it with each other to get a consensus. Is it not a waste of resources. Would it be suffice to find a malicious node and then isolate it. For reliability we could have replication of data and checkpointing and use backup nodes to restart from point of failure.

Applications:

1. The solution to the byzantine problem is useful and can be used in defense communication systems where the threat level of an attack is high. Defense computing systems would need very high levels of reliability and a system like this could be used.

2. Similarly space research organizations like NASA which would need highly reliable and secure systems could use such a system.

Summary:
This paper formalized the problem of a distributed system dealing with malfunctioning components that could give conflicting information to different parts of the system into the Byzantine Generals Problem; then proposed the algorithm to solve this problem using both oral message and signed messages. The author also identified circumstances where this problem in unsolvable.
Problem:
A failed or malicious component of a distributed system could exhibit arbitrary behavior; in particular, it could send conflicting information to different parts of the system; and a highly reliable system must cope with this kind failure.
Contribution:
1. This paper formalized this problem into an abstracted Problem where a group of generals trying to achieve consensus in face of traitors with explicit assumptions and clear statement of the goal we are trying to achieve.
2. The author then reduce the Byzantine Generals Problem into a simpler form: the Byzantine Generals Problem where one commander is sending orders and all of its loyal lieutenants should agree on a reasonable act.
3. The author examined the situation where this problem is unsolvable: when there are no more than 2/3 of the generals are loyal, and when only oral messages can be exchanged.
4. The author proposed induction-based algorithm to solve this problem when it’s actually solvable. Both scenarios are considered: oral message exchanging or signed message exchanging. The author then formally proved the correctness of their algorithms.
5. The authors then extended their solutions to deal with missing network links and discussed how to use their solution to build reliable computer systems.
Flaws:
I think the author should have explained in more detail or perhaps formalize what’s a good plan and what’s a bad plan. Even though they proposed a method for the generals to reach a decision, it’s kind of vague because they didn’t explain what a “robust” method is.
Also, throughout this paper, they discussed how to cope with m failures but didn’t discuss explicitly what to do when one has no knowledge what m is, which is a quite possible scenario. Although I think the solution is straightforward, it would be worth mentioning.
Relevance:
Theoretically this paper is of high value because it formalized the arbitrary failure problem and proposed correct (yet expensive) algorithm to cope with that. To apply it to the real systems, we need more efficient algorithm though.

Summary
This paper provides a number of thought experiments in which a group of some loyal and some traitorous generals are collaborating. Correctness proofs for various algorithms that allow the loyal generals to come to proper agreement are given. The algorithms apply to real life distributed systems where some systems may exhibit arbitrary behavior because of a byzantine failure.

Problem
A common mistake is to assume that computers will either function properly or totally fail. Often, however, bugs will cause systems to keep operating while exhibiting incorrect behavior (minor bugs may make this behavior appear correct). An even more troublesome scenario is when some systems are compromised by an attacker that wants to avoid detection and has complete control over what the systems do.

Contributions
This paper makes a number of contributions:

1. A proof is given that the Byzantine Generals Problem (BGP) cannot be solved with oral messages when a third or more of the nodes are misbehaving. The proof by contradiction works by showing that if BGP is solvable for more than three nodes and a third or more failures, the impossible three general version of BGP can be solved by making the three generals simulate multiple generals (like virtual machines). Although impossibility proofs are discouraging, they are major contributions because they are difficult and save other researchers lots of time!

2. An algorithm is provided that shows the problem CAN be solved with oral messages when less than a third of the systems are misbehaving. The algorithm is recursive. The idea is that every node should agree on what every body else received from the commander. Thus, when the algorithm is trivially small, everybody can just use the value received from the commander, but otherwise the algorithm can be used recursively so that each node A can make the other non-commander nodes agree on what A received.

3. An algorithm for the signed message case is also given. The idea is that now nodes can give proofs of what they have received, so every node keeps forwarding proofs until each node has a proven set of what everybody else received. If it is proven that everybody else received the same command, the nodes simply obey. Otherwise, the commander is a traitor, but every every node has the same set of commands that were sent from the commander, so they can each apply the same function to that set to reach a uniform decision.

Flaws
Suppose there are 10 computers in a distributed system X. The paper provides a solution for dealing with 3 computer failures, giving the impression that system X can deal with a 30% failure. In section 6, however, it is noted that communication line failures count towards the max of 3 failures that can be handled. Since there are 90 communications lines and 10 computers, we have 100 points of potential failure. Thus, in reality, X can only handle failure of 3% of the potential failure points. This is not a major flaw, as doing better isn’t really possible, but it would have been nice if the paper had pointed this out, showing how little hope there is for dealing with this problem.

Application to Real Systems
The ideas seem most applicable to a non-malicious environment such as the flight controller example. It seems plausible that being byzantine fault tolerant would be possible in these cases, but in the malicious case, it would seem like an attacker that could compromise one node could likely compromise others.

Summary:

In this paper the authors try to implement a Practical Byzantine Fault Tolerant system. Their proposed algorithm works in asynchronous environments and is faster than the previous approaches. Specifically, their implementation of NSF runs 3% slower than the unreplicated NFS. The guarantees in the paper hold only at most [n-1/3] of the replicas are faulty.
The algorithm is a form of state machine replication and guarantees safety and liveness. The state is established and monitored through the messages that are sent using this algorithm.

Problem Statement:

The problem that they are trying to solve is that the state of the art solutions at that time for having Byzantine Fault Tolerance are not practical. The reason for impracticality is that either they make unreasonable assumptions about the environment where the protocol will be running (ex. synchrony), or that the solution is prohibitively slow.

Contributions:

  • As it is clear from the title of the paper, the main contribution is to build a system that is Byzantine Fault Tolerant but can be used in practice.
  • This system can work in asynchronous environments
  • Performance overhead seems to be not prohibitively large
  • They have not made strong assumptions about the environment that their system will run in. For example, delays and out-of-order delivery is tolerated.
  • A strong adversary model is chosen. Adversary can coordinate faulty nodes, delay communications, etc.
  • Flaws of the paper

    I am not sure if the following weaknesses can have a feasible solution or not, however, it seems reasonable to me that they should be addressed if a system wants to be practical.

  • They assume independent node failures. However, as the Professor mentioned in class with the case study in Yahoo!, it seems that node failures are not totally independent. In fact, once something goes wrong, a lot of the nodes are affected. The proposed solution to use different implementations of software seems reasonable but is hard to achieve due to high overhead in agreement, maintenance, and training costs. N version programming also has been proven to have deficiencies as some corner cases are not covered by most of the programmers. Thus, it seems that removing this assumption might make their system impractical in the real world.
  • Assuming deterministic responses from the replicas also seem to be a strong assumption. They do mention this can be solved by adding an extra phase to coordinate the responses from the replicas. However, to me this increases the overhead significantly.
  • Using public key infrastructure for signing seems to be very inefficient. They could use shared secrets that are much faster, but that would mean every 2 node should have a shared secret. Maintaining these secrets might make the solution impractical. As far as I know, in today's data centers such signature mechanisms are not used and keeping the adversary out of the data center is the firewall's or the Intrusion Detection System's responsibility. Why are they not able to assume that such a technique exists and eliminate this overhead?
  • Applications

    Availability is very important for many services today. The fact that you are able to continue to provide service even though many of your nodes are down is in some cases critical (ex. for an emergency service provider) or is strongly required to remain competitive in the market (ex. Web search service providers).

    For the same reasons, performance is very important. You should be able to respond quickly to services both when you are operating normally and when you are having some faulty nodes. Therefore, low overhead is very important for these systems. Most companies and critical service providers currently have some sort of replication in order to mitigate the availability problem, but it seems that there are not guarantees. If one could introduce a solution that provides guarantees such as the ones introduced in this paper with low overhead, it would be certainly used.

    I am not sure if there currently exists a solution for Byzantine Fault Tolerance that is being used in practice and would be interested to know that.

    Post a comment