The Byzantine Generals Problem
L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.
Review due Thursday, 2/23
« Practical Byzantine Fault Tolerance | Main | The Part-Time Parliament »
L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.
Review due Thursday, 2/23
Comments
1. one or two sentence summary of the paper
The paper introduces what is the byzantine generals problem and presents several solutions to the problem under various assumptions (using only oral messages or using unforgeable written messages / all generals are fully connected or generals are connected under certain conditions) and shows that how to apply them to implement reliable computer system.
2. A description of the problem they were trying to solve (and how the problem solved)
One important motivation for the paper is to guarantee the reliability of computer systems in which malicious attacks and software errors may cause faulty nodes to exhibit arbitrary behavior (e.g. typically give conflicting information to other nodes). The paper first abstracts this situation to the byzantine generals problem which is to find an algorithm to ensure that the loyal generals will reach agreement. After obtaining solutions of the byzantine generals problem, they are applied to reliable computer systems.
3. A summary of the contributions of the paper
First, this paper draws our attention to a common failure in computer systems, that is certain components in the systems may send conflicting information to other parts of the system. Since this problem is important and will become increasingly significant in the future considering the malicious attacks and software errors are increasingly common to cause such faulty component in the system, the research of problem is really worthwhile. Second, the paper abstracts this problem as the byzantine generals problem, which is easier to discuss and obtain solutions. And the paper also discusses how to apply the solution of byzantine generals problem to ensure reliable computer systems. Third, the paper proposes algorithms to solve the problem under different hypotheses. (algorithm OM for the case where generals use only oral messages; algorithm SM for the case where generals use unforgeable written messages; and without the assumption that a general can send messages directly to every other general, these algorithms are extended by applying undirected graph).
4. The one or two largest flaws in the paper.
First, although the paper presents theoretically correct solutions to the byzantine generals problems, these solutions are expensive in both the amount of time (lieutenants have to wait for the messages from other lieutenants to make the decision) and the number of messages required (all lieutenants need receive messages from the commander and all other possible lieutenants), so that it may not be practical to be used in real systems. Second, although OM and SM algorithms are extended without the assumptions that a general can send messages directly to every other general, certain strong requirement for connectivity (e.g. n-regular graph requirement) remain which may make the algorithms not feasible.
5. A discussion of how the ideas in the paper are applicable to real systems.
Solutions to the byzantine generals problems may be used in wireless sensor network where nodes may often exhibit byzantine behaviors. Actually, the data fusion (data from different nodes in the network need to be fused) in WSN is an important topic which is similar to the byzantine generals problem that generals need to reach agreement.
Posted by: Junyan Chen | February 23, 2012 07:59 AM
This paper discusses about ideas of making a distributed system resilient to a new class of failures called "Byzantine Failures". The paper treats the problem theoretically and proves the "existence" of solutions to this relatively new (during the time of the paper) problem under various assumptions. The paper also discusses about the practicality of these theoretical solutions.
Byzantine failure is possible in any distributed system with multiple individual control processes where any individual process can arbitrarily become faulty and participate in any distributed protocols of the system. Such failures can happen unintentionally due to buggy processes/hardware or intentionally by malicious processes introduced in the system. The aim of this paper is to design distributed systems resilient to these failures or in other words, achieve "correctness" even in the presence of such faulty nodes (correctness referring the state of the system in the absence of such faulty nodes).
The major contribution of the paper is formalization of the Byzantine Generals Problem (BGP) which can be solved if the following conditions are satisfied in the presence of disloyal generals.
When the commander sends orders to all lieutenants,
IC1: All loyal lieutenants must obey the same order (specifically when commander is faulty)
IC2: If the commanding general is loyal, then every loyal lieutenant obeys the order he sends.
Here, loyal lieutenants in the distributed system's context is equivalent to "non-faulty" processes. These are the conditions any solution for BGP needs to satisfy.
Another major contribution is the two theoretical solutions for handling byzantine failure. One where plain text messages are exchanged between the generals and another using cryptographically signed messages. The basic idea behind the first solution is that the non-faulty nodes communicate only the "majority" of orders (retreat or attack) received and tries to arrive at a valid consensus, where valid being the consensus arrived satisfies IC1 and IC2. This idea inherently requires at least (2/3rd of generals + 1) generals (or communicating processes) to be non-faulty to satisfy the conditions. In the second solution, they extend this first solution with the aid of signed messages. In both these solutions, the authors assumed that every general needs to communicate with other generals. These solution demonstrate on how any solution should approach to solve the BGP. They also add that by removing or restricting the communication topology of these processes helped in achieving better results.
Though the paper formalized the BGP (and was one of the early paper to address the problem in such detail), it failed to address some of the simplifying assumptions used in the paper. The paper assumed that there are only "finite" or very small vocabulary used in the messages. For example, {"attack", "retreat"} or in other words 0 or 1, would be a serious limitation to the capability of the system to communicate. Intuitively, the complexity of the solution magnifies with a larger set of possible commands or protocol messages. Also, though they try to map the theoretical solutions suggested in the paper to real systems, the solution still assumes certain aspects of the system that is not practical (as pointed out in the other paper). For example, bounded round-trip time for all network packets, this in turn makes it difficult to use timeout mechanism as a proxy for detecting failed nodes. Specifically, in the signed message solution discussed in this paper, it is possible for a malicious process to keep sending in messages to all processes making the "non-faulty" processes to not terminate (after a timeout).
In general, though the theoretical foundation provided by the paper is essential for understanding of the problem, the solutions suggested are very inefficient (as admitted by the authors) and still remain impractical for many above mentioned reasons.
Posted by: Venkatanathan Varadarajan | February 23, 2012 07:57 AM
This paper introduces a set of algorithms to address the "Byzantine Generals"
problem -- essentially, how to handle failures of a non-halting variety, such
that a node may continue operating and communicating, but in an incorrect (and
thus potentially malicious) manner. The algorithms presented deal with this
problem under varying circumstances and assumptions about the available
communication primitives.
The requirements of the problem (in terms of computer systems rather than
actual Byzantine generals) are that: (a) all correctly-functioning nodes must
ultimately agree to act on the same outcome of a decision, and (b) that
decision must be the "correct" one as determined by a given deciding node,
assuming the deciding node is also function correctly. The authors first
address the problem with messaging primitives that offer reliability,
sender-identifiability, and detectable failure; with this foundation they
prove that the problem is solvable only if the number of malfunctioning nodes
is strictly less than one third of the total number of nodes, and present a
"majority-rules" algorithm for doing so.
In the next variation, the authors introduce authenticated messages -- each is
sent with the unforgeable signature of its sender. Intuitively, this makes
the problem much easier; the authors demonstrate this by showing that with
this assumption, the bounds on solvability can be relaxed to allow any number
of malfunctioning nodes. The algorithm they present for this situation echoes
each message seen by a node to each other node that has not already received
it, with chained signatures such that the entire path traversed by a given
message can be verified. As long as the deciding node is functioning
properly, these are all re-signed duplicates of the original decision; if not,
each node may end up with a set of conflicting outcomes (though all will have
the same set). If this occurs, the nodes all select one of them via a
predetermined choice function (which the paper fails to mention must also be
deterministic in order to satisfy their IC1 condition).
The final variations considered are with a less-than-fully-connected
communication graph. The paper gives algorithms for this situation with both
authenticated and unauthenticated messaging. The algorithms are based on the
fully-connected ones and are given in terms of the graph connectivity (in the
unauthenticated case, for the ability to route around malfunctioning nodes)
and diameter (in the authenticated case).
As acknowledged in the paper's conclusion, the biggest problem with the
algorithms presented is their expense in time and communication. They give
the messaging complexity as (n-1)(n-2)...(n-m-1), which I believe adds up to
O(n^(m+1)) communication complexity, if I haven't botched my math.
Additionally, the algorithms are all given in terms parameterized by the
number of malfunctioning nodes they can defend against. In a real system, if
one wants to handle the worst case, one would have to accommodate all but one
node malfunctioning, assuming message authentication is available. (With
m=n-1, I guess this would lead to O(n!) messaging complexity, actually.)
Presumably a system designer would choose a relatively small percentage of n
as a not-too-pessimistic bound on a tolerable number of malfunctioning nodes
(m), but exponential (or factorial?) communication growth encourages reducing
this parameter, and thus the robustness of the system as a whole.
Posted by: Zev Weiss | February 23, 2012 05:38 AM
This paper aims to handle conflict information in different part of a system. It proposes two algorithms Oral-Message(OM) and Signed-Message(SM) algorithms, and use them in different scenarios, full-connected graph and nonfull-connected graph. Finally, the algorithms are deployed in repliable circuit design and their assumptions are discussed.
In distribted system, information about the same item can be different in different components. This may be caused by the data transfer interference, malicious components and so on. While different components need to agree on the same information to make a common decision. This lead to the requirement that in a distributed system, each reliable component needs to get the same information from other components. This problem can be reduced to the problem how one compenent transfer the same information to all the receivers and guarantees the reliable component get it.
In the paper, the sender is descirbed as the commander and the receiver is lieutenants; all of them are generals. Using an example of 3 generals with 1 traitor, we show that there is no solution. If there are more than 3m generals and at most m traitors, the algorithm OM(m) sovles the problem; that is, all honest generals get the same command. OM(k) is described in page 388; the main idea is using mathematical reduction. When a commander sends message to all lieutenants, the lieutenants exchange the message to each other and know others' messages, then make a majority vote. When lieutenants exchange messages, the the problem is reduced to a OM(m-1) problem of a lower order.
If there are at most m traitors, the algorithm SM(m) solves the problems. SM(m) is described in Page 392; in this case, a traitor cannot forge messages any more, and reliable lieunenants will just sign the received message and forward to all others. Each lieunenants will make majority vote on all messages received. Since only reliable lieunants forward message, all reliable lienunants get the same messages, and vote to get the same result.
In missing-path scenarios, for any m>0 and any p>=3m, OM(m, p) sovles the problem if there are at most m traitor(the graph is regular-p graph). For any m and d, if there are at most m traitors and the subgraphs of loyal generals has diameter d, then the algorithm SM(m+d-1) solves the problem.
Discussion: (1)The messages seem to be flooding, which is (n-1)(n-2)...(n-m-1). Which will cause large overhead in networks.
(2) To synchronize the message in all components, these two algorithm exchanges too many messages and takes too many steps, which will cause the difficulty to design protocol and each component's behavior.
(3) The algorithm aims to handle components with malicious propose or malicious function. But in actual distributed system, we can use identification verification to protect from the former, and use checksum to protect fromt the latter; these mechenism can save the message overhead compared with OM and SM
Posted by: Wenfei Wu | February 23, 2012 05:28 AM
This paper presents a problem called the Byzantine Generals Problem which is used for problems that require agreeing on the correctness of data where some of the systems may be malicious or in error. Information is presented for several methods of increasing the reliability to allow the number of locations with errors to increase while still allowing determination of the correct result.
The goal of this paper is to determine solutions to the Byzantine Generals Problem as could be used to detect errors in system communication. The idea of this problem relates to connections that need to be verified, with sources of error coming possibly any location. It would be good to be able to detect errors by having other correctly functioning systems which would in comparison produce different results. A question is how many correctly functioning locations are needed to offset incorrect results coming from others.
This paper uses a method with one of the Generals being a commander while the rest are lieutenants during the solution. In the initial setup, the commander can communicate directly with all the lieutenants. It is not known if the commander is in error or if other lieutenants are in error but the goal is to reach an agreement on the correct value with the remaining correct commander and lieutenants and avoid errors. It is said that with three or fewer Generals, it is impossible to verify correctness for this problem. As the number of communication locations increase, there is always a bound for how many can pass along an incorrect answer while still allowing the correct locations to reach an agreement. Several variations of the Byzantine Generals problem were considered which involve how many connections there are between all the locations. These solutions can help increasing the reliability of a system while allowing more failures which still result in errors that can be detected. It relates to how one system could have sent faulty data to a second system in a network and then one has to identify for certain which of the two systems caused the error and where the correct value can still be found. One modification that increases the effectiveness is adding security signatures to data so that the source of the data can be verified. This helps determining the original location of the faulty/malicious data.
A question with this paper is how realistic is it for a system to have a full Byzantine Generals solution and at what point the expense of increasing the number of "traitors" that can be tolerated by one more could not be better spent improving another part of the system.
The idea of being able to detect certain systems providing incorrect data in a system is something that could be useful in a distributed system with the ability to compare data with clear, reliable inputs to use as comparison. The algorithms as presented in this paper rely on assumptions about the connection structure between systems that would not seem to apply in most situations. If high reliability of data on a system is required at all times including during times of multiple component failures that might otherwise remain undetected apart from the differing results, then something similar to this plan would be needed. Previous systems solutions involving clearly identifiable failures would not apply here.
Posted by: Daniel Crowell | February 23, 2012 02:44 AM
This paper discusses the Byzantine General Problem and solutions to this problem in a variety of failure cases in which there are one or more traitorous actors giving out false orders. There were a variety of situations in which the author looked at to determine how to detect a traitor in each situation. Such as in cases where there all the actors can directly communicate, where the actors are connected through an intermediary, and the authors also give a solution to the classic three actor problem as well. In addition to the variety of situations mentioned in this paper the authors also looked at the effects of using various communication methods and authentication methods were in determining traitorous generals. They looked at solutions in which you could verify the message came from a particular source and Oral messages without message source verification. The authors give a use case at the end involving multiple systems for which a value is calculated using the same set of inputs for which they want to determine the correct output from possible faulty actors.
The solutions to the Byzantine General Problem given in this paper seem like they would work correctly in small scale settings (where there are a fairly limited number of actors). However since most of these solutions require communication directly between all host's in a network this method would not scale correctly. Another issue with these approaches is that they require sort of setup in advance that may have to be done manually before starting the system. Such as in the Message Signing case, for complete security the message signing key's must be delivered to each host in some manor in which they cannot be altered and can be verified before execution of the system which in it of itself is a Byzantine General Problem which may be not solvable by the algorithms given in the paper in certain cases (such as number of actors = 3). The other issue would be with synchronization. Specifically determining if a message has not been received for a specific iteration. This would require that the clocks on the machines being used are synchronized and that everything happens in fixed time. This may require that the hardware/software used for this system be set up in some custom manner to guarantee that this occurs (such as a custom operating system, which is somewhat impractical for most applications).
Posted by: Benjamin Welton | February 23, 2012 02:42 AM
Summary:
This paper discusses another type of failures in building reliable systems, namely Byzantine failure. The paper abstracts Byzantine failure situation into the Byzantine Generals Problem (BGP). It demonstrates that there is no solution for the problem with oral messages only if loyal generals are not more than two-thirds of the total. It then gives two algorithms for oral message situation (which requires N>3m+1 with m traitors) and signed message situation which works for any traitors. It further discusses communication requirements and feasibility of implementation.
Problem:
Reliable computer systems need to deal with not only halting failure, fail-stop failure, omission failure, etc. but also Byzantine failure, in which a malfunctioning component could behave arbitrarily, e.g., sending inconsistent/misleading information to other parts in the system. The problem here is how good nodes could still reach agreement and hopefully make good decisions in the presence of faults.
Contributions:
+ A very well-written theory flavor paper with good problem abstractions, nice definitions (e.g. p-regular graph), impossibility, algorithms and strict proofs. The paper brought up an important problem which is as interesting as Dijkstra's dining philosopher's problem but more worth attention. Actually the Byzantine problem was introduced in an earlier paper “Reaching Agreement in the Presence of Faults” but Lamport presented it in the form of an interesting story here.
+ By making reasonable assumptions (A1~A4), the paper comes up with recursive algorithms which apply to oral message and signed message situations. Later these algorithms are extended to less connected communication graphs. It turns out that the graph has to be 3m-regular for OM(m) algorithm, and the subgraph formed by loyal generals has to be connected for SM(m) algorithm, which is the weakest connectivity requirement for BGP to be solvable.
+ Implementation discussions are also interesting. The challenge lies in how to implement those assumptions. In turn, other two important problems had emerged – clock synchronization and signature in cryptography.
Flaws:
- It is very likely that m is hard to determine in a practical system. Even m is figured out, it may be expensive to get at least 3m+1 replicas for m failures.
- It is more like a theoretical paper although the author discusses some implementation details. I think the main concern about solutions in this paper is their communication/signatures overhead and requirement for clock synchronization. In order to handle m Byzantine failures, it requires m+1 rounds of communication, and in turn each round requires O(n^2) messages. Also, clock synchronization means a tightly-coupled environment. It would be interesting to see how BGP can be solved in asynchronous system e.g. Internet (Practical Byzantine Fault Tolerance paper talks about this).
Applicability:
Expensive BGP model is not so often considered in practical system design. Tradeoffs are often made between reliability and performance. But I admit we do need it in mission critical situation, e.g. airplanes. Actually I built a three-node fault-tolerant system which is for airplane purpose before, in which hardware are highly specialized and customized. For instance, nodes are connected by high speed optical fiber or 1394 bus, and a 14-layer backplane is specifically designed for each node. Then it seems high reliability is achieved without much sacrifice in performance in this case.
Posted by: Yizheng Chen | February 23, 2012 12:15 AM
The authors propose approaches to develop a reliable system in the presence of component failures ---including components sending conflicting information. In particular, they analyze an abstraction, the Byzantine Generals Problem (BGP), under different communication scenarios.
The main problem is to accomplish that every royal lieutenant executes same order, and they all obey the orders of a loyal commander. The challenge is that the traitors may provide conflicting informations to different lieutenants. The one commander two lieutenant example clearly demonstrates this difficulty that a lieutenant may get the same set of orders, {attack, retreat}, under two different scenarios; and there is no way for that lieutenant to identify who the traitor is.
The main contribution of the paper is the presentation of the BGP itself. Although the practical applications are not well motivated in the paper, in the recent years we see that byzantine fault tolerance can be very important for large data-centers like Amazon's or Google's. A failed component that provides conflicting information may aggregate the failure and may cause the entire system unavailable in a very short time. Therefore, the discussion on different scenarios like oral messages and signed messages along with different topological issues provide insights about how to deal with such failed components.
Another important contribution is the impossibility results. Although the theory seems to be so strong, I am surprised how easily and clearly they represent this fact. The bounds on maximum number of failed components can be very useful while designing systems. One can run experiments and estimate the expected number of to-be-failed components and determine whether it is possible to tolerate byzantine failure or not.
Obviously, one flaw or shortcoming of the paper is the inefficiency of the proposed solutions. Both OM and SM algorithms require exponential time which will definitely cause scalability problems. This fact hindered the applicability of byzantine fault tolerance for a long time. But the recent advancements clear this inefficiency and provide fast solutions and open the way for practical applicability. “Practical Byzantine Fault Tolerance” paper presents such an algorithm to deal with byzantine faults.
Malicious attacks and software errors are expected in large data-centers which means that we will end up having systems where some components behave unpredictably. In such cases, in order to have consistency and reliability, the working components of the system should agree altogether. To develop such systems, while this paper provides invaluable insights about the problem, the following works provide more practical solutions. I also believe that the ideas would be useful in case the identities of the components are unknown, or reaching a consensus to select leaders of a system; because in both cases we should be dealing with components providing conflicting information.
Posted by: Halit Erdogan | February 22, 2012 11:57 PM
This paper talks about a class of algorithms that ensure reliability and fault tolerance in a distributed system that is prone to byzantine failures and proves the limitations of the algorithms in terms of the number of traitors and the connectivity of the system and complexity of implementing the algorithms.
The problem solved by this paper is rather original at the time of this paper and it introduced a new dimension to failures in distributed systems. The chief problem addressed by this paper is: In a distributed system where faulty nodes tend to behave arbitrarily (or even maliciously), can a consensus be reached? If so, at what expense and under what limitations?
The major contribution of the paper is perhaps, the quantification of theoretical limits to achieving consensus in a distributed system with byzantine failures. The paper introduces the very idea that failed nodes may exhibit arbitrary behavior (which could be extended to the security issue of a malicious node exhibiting malicious behavior to hinder consensus) and describes under what circumstances a reliable consensus can be reached.
Some nice points in the paper about the paper's contributions:
- It is interesting how the paper simplifies and formalizes the Byzantine Generals problem and starts with a solution for the simplest, most primitive case of the problem and then progressively advances to more sophisticated algorithms and more practical systems.
- A nice aspect of the paper is that it lays down the impossibility of solving the 3 node BGP with one traitor (in the presence of oral messages). A more subtle but significant point is the simple reduction of the m node BGP with at least m/3 traitors to the 3 node case.
- The paper presents the solution of using a simple majority function to reach a consensus for the oral messages only scenario. The more obvious but involved proof presented by the paper is the one for signed messages. The result that, in the presence of signed messages, an n node BGP can support any number of traitors is rather important.
- The paper then relaxes the assumption that every node can communicate with every other node and formalizes the solution by imposing a constraint on the topology of the system connectivity.
- The paper emphasizes that the Choice() and Majority() functions used in the implementation of the solution algorithms are trivial by themselves and the true complexity of the algorithm lies in materializing the assumptions of the algorithm.
The paper does not contain any flaws as such. There are however a few things that caught my attention:
- The assumption that there is a fixed time limit to declare absence of message seems rather tough to estimate in large scale systems.
- Though the paper formalizes the solution for cases of limited connectivity in terms of network topology, it would be rather impossible to find real systems' topologies (with required properties) to apply the algorithm.
- The paper assumes that the system is inelastic (new nodes may appear or existing nodes may disappear - malicious ones especially). Some light on the behavior of the consensus algorithm in the presence of elasticity would've been interesting.
Overall, the paper is definitely one of its kind. It is one of those rare papers that look at a completely unexplored, but blatantly obvious problem, simplify it to subject it to formal mathematical treatment and generalize it to real systems. Though consensus algorithms incur a lot of message and significant processing overhead and there are few practical cases in which byzantine fault tolerance is truly required in today's large scale systems, the novelty and concrete treatment present in the paper makes it a must read!
Posted by: Venkatesh Srinivasan | February 22, 2012 11:10 PM
This paper discusses the fault tolerance problem in distributed system. By analogizing the problem to the Byzantine Generals Problem, this paper analyzes the challenges of fault tolerance problem in distributed system, provides two algorithms to solve the problem, and discusses their applications in real distributed system implementations.
The problem the paper solves is an important problem in distributed system, which is how to handle the malfunctioning components that give conflicting information to different parts of the system. The problem is abstractly re-expressed into the environment of Byzantine Generals Problem. So it can be rewrites to a new problem which is a commanding general must send an order to his n-1 lieutenant generals such that IC1) all loyal lieutenants obey the same order, and IC2) if the commanding general is loyal, then every loyal lieutenant obeys the order he sends.
This paper solves the problem mentioned above by providing a theoretic analysis and solution to the Byzantine Generals Problem. The first discussion focuses on the difficulty of the Byzantine Generals Problem. This problem cannot be solved if the generals can send only oral messages unless more than two-third of the generals are loyal. By defining oral message with assumptions that A1) the message is delivered correctly, A2) the receiver knows the sender of the message, and A3) the absence of a message can be detected, algorithm OM(m) is presented and proved can solve the problem correctly. The limitation introduced by oral messages is mentioned above, which there must be more than two-third of the generals being loyal. To overcome it, signed messages is defined with assumption A4) the signature cannot be forged and it can be verified. With signed message, algorithm SM(m) is presented. The situation with missed communication paths is also discussed.
The paper looks well-written and considers the situations the algorithm may face in a real implementation. Although it has mentioned, the efficiency of the algorithms are still a problem, no matter whether the time complexity or the number of messages. The complexity may not be able to be accepted in real environment. Even if the paper suggests the complexity can be reduced by focusing the system on specific types of faulty instead of providing a perfect system, the further discussion on how to deal with some specific types of problems is not presented in the paper. This makes the contributions of this paper still hard to be deployed into implementation. Besides, further security and cartography issues to satisfy the assumptions of algorithms remain difficult to implementations.
In sum, the paper provides a great reference on building a fault tolerance system, although lots of details need to be resolved and trade-off in implementations, which are still a difficult and complicated problems to be solved.
Posted by: Xiaoyang Gao | February 22, 2012 10:57 PM
This paper provides solutions to solve the Byzantine general problem under different assumptions and proves their correctness. It also discusses how these solutions can be applied to build a reliable system tolerating byzantine failure model.
Byzantine failure model is that disfunctional nodes can produce unpredictable output or even produce deceptive output. Tolerating byzantine failure is very difficult. This paper focuses on how to reach agreement for normal nodes in the presence of byzantine failure.
The key contribution is it solves the byzantine general problem. Even though this problem itself is in a context not related to computer system, there is obvious mapping from the byzantine general problem to building a byzantine failure tolerant system. Nodes need to reach agreement for many reasons. Which update is more recent? Is a certain node down or not? These questions decide the states of the whole system and nodes have to make decisions according to their own observation and information from other nodes. Halting failure is easier to handle with than byzantine failure because nodes would never provide output to corrupt other nodes. Since other nodes can only influence the inputs of a loyal node rather than the computation, it is necessary to have redundant inputs and avoid those deceptive inputs. This provides an intuition of the reason why the algorithm would involve so many messages. This paper also gives some methodology guidance on designing distributed systems. Like doing computation several times and then deciding the result using majority vote and making assumptions about the specific failure that may occur. Replication is also given another meaning because the more replicas we have, the less likely the whole system would be fooled by traitors.
One flaw of this paper is the algorithms are too expensive and too restricted to be practical. There are exponentially increasing number of messages involved and the k-regular graph is a unpractical topology for network. Another flaw is that when considering assumption A3 (absence of a message can be detected), a fixed maximum time for generating and transmitting a message is required. This assumption seems to not be that safe. Even though network is getting faster and faster, in extreme case there could be infinite delay during transmitting. In most cases, setting a large value can work. But it can go wrong.
Byzantine failure tolerance provides very strong security. Not many commercial systems adopt byzantine failure model because it is not necessary. However, for systems that want to survive malicious attacks or require high security, byzantine failure is an appropriate failure model.
Posted by: Xiaozhu Meng | February 22, 2012 10:33 PM
The Byzantine Generals problem is a the problem of coordinating “attacks” over a set of generals, some of whom may be traitors. There are two subproblems: orals messages, which may be interfered with, and signed messages, which may only be dropped if from a loyal general but can be forged if from a traitor. The two problems require entirely different approaches and have differing applicability. The oral problem is deceptively difficult to solve, and provably cannot be solved in the case of three generals.
This paper formalized the Byzantine Generals problem, and more generally the concept of Byzantine failure where any node may fail in any way -- including sending malicious data. This is a very different model from fail-stop, where a node just stops responding in the failure case. Furthermore, the paper described algorithms for solving both cases of the problem and furthermore proved the lower bound on how many loyal generals you need to be able to solve it -- in the case of oral messages, 3m+1 general for m traitors. In the case of signed messages, you need only m + 2 generals; any less and the problem is a non-issue.
The paper was groundbreaking, but has flaws, which likely kept it from being used for a decade. Firstly, the requirements it makes on the problem are extremely strong -- the oral generals must have an N-regular graph, which is nearly as strong a requirement as a connected graph. Furthermore, each general must be able to tell who sent the message. This is fine in the case of a fixed-line network, but any time there is switching, you must add a new set of nodes to the Byzantine problems to represent the routers and switches or somehow have an out-of-band authentication mechanism. Thus, to have such authentication over a real network, you must use some method of cryptography.
This was the primary issue we had with the paper: the oral message subproblem, which the majority of the paper is devoted to, is very difficult to apply to any real problem. This is because the requirements are so strong that you would likely already have message signing if you fulfilled them. if you have signed messages, you can devolve to the simpler case and avoid the complicated algorithm. We did, however, note that the oral problem could be useful in the key-distribution required for the signed message problem.
The signed message problem, however, is extremely applicable. Any mission critical system -- which is to say, a system which cannot fail unless it is absolutely necessary and must always return the best response possible under any conditions -- would benefit from this approach. Even more mundane systems could benefit from it to limit error propagation from slowly corrupting the entire system. Other notable examples of its use is in GPG chains of trust (under certain models) and Bitcoin, which is discussed further below.
However, our group had dissenting opinions on the paper:
One point is to clarify/suggest a different ‘primary contribution’ from this paper. The primary review suggests the primary contribution is simply “discovering” these kinds of failures or the creating of an algorithm to handle it. An alternative view is that the primary contribution is the principled formalization of fault tolerance subject to this specific model for failure.
In this sense, the paper’s formalization becomes a way to understand, model, and compensate for failures subject to a particular model. This work can then be leveraged in two different ways: (1) advancing theoretic understanding of fault tolerance in general beyond the specific failure model and (2) providing a basis for compensating for failures within this specific model in practice.
By criticizing the paper for the specific algorithms it uses only addresses the direct application (2) of the paper. Such criticism does not address point (1), which reflects the theoretic ground work this provides. By exploring this specific problem, even if limited, provides a basis for research and exploration of other failure models.
Considering bit coin, the distinction between (1) and (2) is very important. If this paper were to just have provided a fast algorithm for solving failures but not any theoretic grounding, then bitcoin wouldn’t have the theoretic backing to work from. It is because the N-Generals problem was handled theoretically that bitcoin can benefit. This is because bitcoin needs a proven theoretic backing to function (in that, without theory backing it, no one would trust it with money).
This example clearly illustrates that it is sometimes necessary to not only have a practical algorithm but also a theoretically proven algorithm.
(David, Seth, Victor, Igor).
Posted by: Anonymous | February 22, 2012 10:17 PM
This paper exposes a general problem in distributed computing called the Byzantine Generals Problem (BGP) which is an instance of the class of problems in reliable computing. The paper is concerned with the specific problem of how one can ensure that a system of computers will come to an agreement even though some members of the system can be faulty, where this faultiness could result in arbitrary behavior (sending erroneous messages, for example). To make the problem easier to think about, the authors model it as a problem involving a number of Byzantine generals who need to collectively decide whether to attack or retreat even though some of them may be traitors. The criteria for a successful algorithm are that 1) all loyal generals should decide on the same action and 2) it should be impossible for a small group of traitors to make the loyal generals choose a bad action.
BGP itself is a subproblem of this problem. In BGP, we imagine that a general is trying to send a command to her lieutenants, and the lieutenants must decide collectively what to do, even though the general and each lieutenant may be traitorous. Specifically, it must be the case that IC1) the loyal lieutenants all decide on the same action and IC2) if the commander is loyal, then the loyal lieutenants all obey his command. This subproblem is needed to solve the problem of general agreement since each general must decide what action the other generals are voting on which amounts to the problem of deciding what command each general is sending.
The authors show a number of interesting things. They show that, if the graph is strongly connected, then there is no solution unless more than 2/3 of the generals are loyal. Further, they show that, if there are at most m traitors, then there is an algorithm that works for 3m + 1 or more generals, and they give this algorithm. The authors also consider graphs that may not be fully connected but are at least 3m-regular, showing that, if there are at most m traitors, there is an algorithm that works for 3m or more generals on a p-regular graph for p >= 3m.
One flaw in this paper is that the algorithms it proposes are extremely inefficient. I did not bother to figure out the running time complexity, but at first glance it looks like the OM(.) and SM(.) have O(n^2logn) complexity and something similar for OM(., .). Further, the algorithms all rely one topological properties of the network, which is one of the fallacies of distributed computing.
On the other hand, this paper has given a good context to the epidemics paper we read earlier. In that paper, I don't believe Byzantine failures were considered. It would be worthwhile to look back at the epidemic algorithms and see how they can be repaired to account for such failures.
Posted by: James Paton | February 22, 2012 10:05 PM