CS 739 - Reviews - Spring 2011: Practical Byzantine Fault Tolerance

« The Byzantine Generals Problem | Main | Reliable communication in the presence of failures »

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance; Miguel Castro and Barbara Liskov,OSDI'99

Review due Tuesday, 2/15.

Posted by Michael Swift on February 11, 2011 03:38 PM | Permalink

Comments

Summary: This work presents a strategy for replication in a distributed system that is tolerant to Byzantine faults. The algorithm has guarantees for both safety and liveness and improves upon previous work by working in asynchronous environments and by greatly improving real-world performance. The authors demonstrate one application of the algorithm in a fault tolerant version of NFS.

Problem: Software systems continue to become larger, more complex, and more error prone. Malicious attacks continue to increase. In a distributed system these factors can result in so-called Byzantine faults: nodes can fault silently and continue to operate but send arbitrary and/or incorrect results to the rest of the system. The challenge addressed here is to design a method to safely replicate information to all non-faulty nodes in a system despite some number of unknown faulting nodes. Earlier work had shown that the problem is solvable when less than 1/3 of the nodes are faulty, but the solutions were too slow and relied too strongly on synchrony assumptions for correctness.

Contributions: The algorithm assumes a system of 2f+1 nodes and with at most f faulty nodes, and is presented is in the form of a state machine where the replicas move through a series of views. When a client submits a request, a primary sends a pre-prepare message to all other replicas. If the replicas accept the pre-prepare they log it and send a prepare message to all other replicas. Each replica is considered prepared once it has logged the request and the pre-prepare and has received prepares from at least 2f+1 other replicas. Once a replica is prepared it sends a commit message to all other replicas. A replica is considered committed locally once it receives commits from 2f+1 other replicas. Finally, once the replica is committed locally it executes the request and sends the result to the client. The pre-prepare and prepare phases together ensure that all non-faulty nodes agree on a total order for requests within a view. The prepare and commit phases ensure that requests that commit are totally ordered across views.

Several optimizations are presented that make the system actually practical for real-world applications. Logs are kept small by periodically cleaning out messages whose requests have been proven to be executed by f+1 non-faulty nodes, these events are called stable checkpoints. Communication is reduced by having only one replica send the complete result to the client, the rest send only digests of the result. Requests are also executed tentatively as soon as replicas are prepared and may revert to the last stable checkpoint. There is also a streamlined protocol for read-only requests.

Flaws: The authors briefly addressed the fact that their model assumes that faults are independent, and suggested a few ways to avoid non-independent fault situations. This may be too strong an assumption. There have been reports that non-independent faults are actually more common, even in systems that have taken the preventative steps suggested by the authors. Also, they mentioned but did not discuss what happens when faults are, in fact, discovered in the system. It would be interesting to know what kind of guarantees their system can make about fault detection, and what steps can be taken to recover from them.

Applications: The overall problem here is still very relevant to systems design today. Consumers are using more and more 'cloud' services and storing more on more data online. The optimizations used here are, I think, more than just simple performance gains. Particularly the read-only optimization reminds of the later work on GFS. Certain characteristics about the workload requirements of a system can allow us to relax some consistency constraints. The relaxed constraints can enable significant scale and efficiency gains. In fact, the consistency model here may be stricter than necessary in many cases.

Posted by: Kristopher Kosmatka | February 16, 2011 08:03 PM

Summary: "Practical Byzantine Fault Tolerance" develops a generalized model along with optimizations for a distributed system that uses replication to maintain safety and liveness in the face of Byzantine failures. The authors go on to describe a concrete implementation in the form of a replication layer superimposed on the NFS file system.

Problem: It's clear that the goal of this paper is to broaden the range of applications that can reasonably expect to realize Byzantine fault tolerance. Due to the difficulties presented by such a fault model, prior research had been inadequate in producing any solutions that could be put to use in any general sense because of the prohibitive overhead.

Contributions: The authors of this paper enlist the results and methods of prior works and attempt to carve out and combine the most efficient procedures while adding their own optimizations. In the spirit of being practical, they take the Internet to be the prototypical context in which their system will operate, and attempt to manage the asynchronicity of such a system as a necessary evil. The concrete implementation of BFS and the performance evaluations and comparisons give credibility to the notion that they have developed a framework for building a distributed system with Byzantine failure resistance that is indeed practical.

Flaws: Throughout the paper, the authors regard the value of f, the maximum number of simultaneous failures in the system, to be a known constant. At one point they assert that a system should have no more than 3f + 1 replicas because more than that would increase complexity without any benefits to reliability. In theory the assertion is correct, but because of the paper's focus on practical application, there should have been a better evaluation of the costs and benefits of increasing the number of replicas in a system.

Applicability: As stated, the increased opportunity for malicious attacks on distributed software systems is a powerful motivator for addressing the issue of Byzantine faults and for creating practical implementations to handle such situations. Given the bound on the ratio of replicas to survivable faults established by Lamport, papers such as this one can only hope to develop faster and more general implementations which realize correct operation. The authors do an admirable job of developing and measuring their implementation, and undoubtedly their methods have been used as a base to improve upon in further studies.

Posted by: Rich Joiner | February 16, 2011 08:57 PM

Summary:
The paper is about a practical algorithm that can be used in a distributed system for replicating state machine that is byzantine fault tolerant. Unlike previous techniques, this algorithm works in asynchronous environment like the Internet. The authors evaluate their approach by implementing their algorithm into a real service distributed NFS file system which is shown to be just a little bit slower than a regular NFS file system.

Description of Problem:
Software are becoming more complicated that software bugs are common. Networks and the Internet are becoming larger and larger that malicious attacks are becoming common. Bugs and malicious attacks causes nodes to become faulty causing arbitrary behavior known as Byzantine errors. Therefore we need algorithms that can deal with these Byzantine errors. These algorithm also need to be efficient so that they can be applied in the real world.

Summary of contributions:
This paper describes a Byzantine fault tolerant algorithm for state machine replication that works in asynchronous networks. Previous techniques assumed synchrony or were too inefficient. The algorithm describes a number of optimizations that greatly speeds up the algorithm in a real system such as only having one replica sending the result to a request and having the other replicas only send the digest of the result which greatly reduces the network bandwidth for large messages. It describes a technique to change primary if the primary is found to be faulty, therefore a request should eventually return a response as long as the maximum number of faulty nodes is not reached. The authors also implements a Byzantine fault tolerant distributed file system based on NFS and provide experimental results from their implementation.

Flaws:
The authors assume independent node failures which seems like a strong assumption to me. They mention that software can be implemented using N-version programming, but in the real world, that doesn't not seem to be very cost-efficient. For example, I don't think Microsoft would invest on having multiple team develop different implementation of their NTFS file system so that failures would be independent. Also, as mentioned in the last lecture, tough problems could probably crash for the same operation in different implementations because the problem might just be too complicated.

Application to real systems:
The paper shows that Byzantine fault tolerant algorithm can be used in distributed file systems. The Internet is becoming too large for anyone to comprehend where many nodes can become faulty that I can see how a Byzantine fault tolerant algorithm that can work in an asynchronous network can provide reliable data request in the real world. It seems like there might be a lot of messages being sent if the network is large which might become an issue unless better optimizations techniques are found.

Posted by: Kong Yang | February 16, 2011 09:53 PM

Summary:

This paper describes a practical state machine replication algorithms to tolerate Byzantine faults, which works in asynchronous environments with low overhead. Furthermore, a BFT based NFS system is built to demonstrate how well their algorithms and system work compared with the standard NFS.

Problems:

There are many previous work on how to tolerate Byzantine faults. However, most of those work is too inefficient to be used in practice due to high overhead, because they need to send large amount of messages. Or they rely on the assumption of synchrony of the system, such as the bounds on message delays. So, we need a practical working Byzantine faults tolerant system.

Contributions:

1. This paper presents a novel state machine replication algorithm to handle Byzantine faults in asynchronous environments, which can provide both safety and liveness with assumption of no more than floor((n-1)/3) faulty nodes.
2. Furthermore, it proposes two important optimizations to improve the performance of system. First, it avoids sending most of the large messages by just sending the checksum of the message to reduce the network bandwidth requirements and CPU overhead. Then, it also uses the message authentication codes (MACS) to authenticate messages instead of using digital signatures, which improves the performance further.
3. A replication library is implemented as the basis for any replicated service. Then, a real BFT NFS is built by using this library. The real Andrew benchmark results show that their BFT NFS has very low overhead compared with standard NFS while can handle Byzantine faults.

Flaws:

1. The most important flaw is their algorithm is based on state machine replication, which requires that the program running should be deterministic. Even they provide some tricks to handle non-deterministic operations in Section 4.6. But there are large amount of operations are non-deterministic and hard to convert to deterministic. For example, with the popularity of multi-core machine, the multi-thread concurrent programs are non-deterministic and very hard to replay the scheduling of thread execution order in a deterministic way.

2. Also their system model depends on the assumption of independent node failures. Correlated failures are very often in systems. For example, power failure of a whole rack will bring down all the nodes in the same rack; a malicious administrator will affect all the nodes in the same domain, etc.

Applicability:

BFT is a hot topic. You can find many publications related to BFT on top conferences each year. It seems that BFT is an active research problem in academic because it is very hard and classic. But, I do not know whether there are real systems in industry which is built to handle Byzantine failures. The failure mode of real systems in data center or large group may not exhibit wild Byzantine failure mode. So, in reality, maybe each company will build it own special system to target on certain types of security or fault modes happened in their environments.

Posted by: Lanyue Lu | February 16, 2011 11:34 PM

Summary
This paper develops a practical strategy for tolerating arbitrary faults in distributed systems, largely based on the theoretical work of replicated state machines and Byzantine fault tolerance described by Lamport and others. The system implemented with these techniques is proven to be both correct and live, given that only roughly a third of the nodes are either faulty or compromised. A number of optimizations are explored, which make the approach viable in a real system. Finally, a fault tolerant NFS system is developed, and is shown to have reasonable overhead.

Problem
The proliferation of distributed systems across open and unreliable networks, like the internet, has made faults and malicious attacks increasingly common and problematic. While there has been significant work in trying to solve subsets of these issues individually, this paper takes the strategy of implementing Byzantine fault tolerance, which is a catch-all for independent errors of up to some degree. The major challenge with this approach is to remain correct and live, while not significantly degrading performance.

Contributions
The authors present an algorithm for Byzantine fault tolerance through the use of replicated state machines, which were originally described in previous work. The main contribution here is the protocol which can be implemented directly on various systems, by explaining a number of practical issues. This includes handling primary replication faults (through a view changing protocol, and sequence number watermarks), discarding useless log state, and handling requests which introduce non-determinism. Liveness is also ensured using several techniques, including backoff scheme for introducing view-change requests, and primary replica rotation to prevent malicious primaries from indefinitely impeding progress.
A number of clever optimizations are explained to allow the system to be competitive. One is that requested data may only come from one backup node, because simply the digest of each needs to be verified. Another optimization reduces the protocol stages for read-only requests, and a third optimization use cheap message authentication codes instead of full digital signing.
The other main contribution is the implementation of their BFS file system, or Byzantine Fault tolerant NFS. This proof of concept provides strong evidence that their techniques are efficient enough for real systems.

Limitations
One fundamental limitation is that correlated errors may affect all nodes simultaneously, leading to unrecoverable states. Therefore, these techniques should only be implemented if there is evidence that independent faults are common enough. Another such limitation is that, according to my understanding, this type of fault tolerance doesn’t prevent a compromised primary node from maliciously modifying data. It’s simply that non incoherent state will never exist.
Also, I question the equivalence of the stability assertions in the NFS vs BFS comparison. For NFS, stability was guaranteed by going to disk, and for BFS stability was governed by replication. However, for simultaneous failures (like power failures), NFS should be able to retain more recoverable state.

Applicability
Certainly, fault tolerance will remain an important issue in distributed systems. As the sophistication of attackers continues to increase, and as the size of our systems continues to expand, the need for more rigorous and high-performing fault tolerance has become vital. This paper provides many practical techniques which can aide systems designers achieve a higher level of reliability.

Posted by: Tony Nowatzki | February 16, 2011 11:37 PM

Summary: This paper proposes a new state-machine replication algorithm that can tolerate Byzantine failures. The authors claim that the algorithm involves far less overhead than earlier proposals, and hence, is practical to realize.

Problem statement: The increase in the number and population of distributed systems (over the Internet and otherwise) has attracted a fresh breed of malicious attacks. Recent distrbuted systems are more prone to Byzantine failures because this. Although there are proposals to counter Byzantine failures, they are all 'impractical' because of their overheads. A new solution had to be invented. Also, the solution should be applicable in asynchronous systems (such as the Internet).

Summary of contributions: The primary contribution will be algorithm itself, which can be used to implement a determinisitc replicated system. Clients issue requests to an end point and would draw a conclusion about the answer only when it hears the same response from more than n + 1 nodes, where n is number of failures permitted. The system implements a view based 3-phase commit logic to fulfil the request and respond to client. Any replica commits a transaction only after hearing about the result from 2n + 1 messages. The numbers are so chosen to provide the minimum covering set (overlap) for an operation in terms of number of replicas involved in a transaction.

The authors also propose the various optimizations to make the algorithm practically feasible. Also, a B-F tolerant implementation of NFS has also been described in the paper, which was used for performance overhead study.

Flaws: The system needs 3f + 1 nodes to guarantee tolerance of f faults. Although this is a theoretical limit, I wonder how many real world systems employ this number. Also, the algorithm requires that the client hear from at least (f + 1) replicas, which, again, sounds like asking too much of the client. Why should the client even know its dealing with a distributed system? The paper assumes independent failures - in a typical enterprise, code pushing is uniformly to all machines in one go, and so the same bug might manifest in a cluster. Also possible are power/rack failures; which begs the question - is the assumption too strong?

Discussion: We can immediately see how the ideas can come of use in distributed systems that demand high reliability. The level of abstraction assumed by the paper (of a generic service with operations) makes it easy to implement this algorithm on top of many large scale systems currently in use - websites, storage/DB services, authentication systems etc.

I wonder if we can tradeoff the number of nodes required to realize acceptable fault-tolerance without having to max out on the number of replicas (3n + 1). This might also open up opportunities for different QoS levels.

Posted by: Srinivasan T | February 17, 2011 12:44 AM

Summary
This paper presents a replication algorithm that can handle Byzantine faults. Their algorithm is unique in that it is designed to be used in a real world environment such as the Internet. In addition, they implemented an NFS service which uses the algorithm, showing that it is actually practical in a real world system.

Problem
In the increasingly complex and dangerous world of computers, it is becoming more and more important that systems be able to deal with various types of failures. One such type of failure is a Byzantine failure, where nodes can exhibit arbitrary behavior. Furthermore, if we want to actually protect against such faults, we need a system that can run quickly and correctly in real world situations.

Contributions
This paper has several important contributions. First of all, it presents a state-machine replication protocol that can be used to correctly handle Byzantine faults in asynchronous networks. In their algorithm, services are replicated across multiple nodes; clients request a service by sending messages to a primary node, which in turn broadcasts the request to all replicated nodes. As seen previously, the system requires 3f + 1 nodes to handle f faulty nodes. It provides guarantees for both liveness and safety, and is able to handle network failures (i.e. lost messages). Given that the goal of the authors was to come up with a usable system, they also implemented several optimizations to improve performance.

The other main contribution in this paper is their actual implementation of a system that uses their algorithm. They created a replication library which a client uses to request services from the system, and implemented BFS (Byzantine-fault-tolerant File System) on top of that. Their system runs with a fairly reasonable overhead compared to normal NFS, showing that their algorithm is practical in a real world situation.

Flaws
The authors assume a Byzantine failure model. While this is an interesting problem, I wonder how common it is in the real world. As they mention, fully independent node failures are not necessarily a safe assumption, and the steps they suggest for ensuring that this holds seem extreme (each node runs different implementations of software and operating system?). Perhaps this could be feasible in a small network, but this brings me to my second concern: is this scalable? Several aspects of their design are focused on ensuring that even with many nodes the system works, but their implementation included only four nodes. Does this work with many more nodes? Could a large network like Google use this algorithm to provide Byzantine-fault tolerance on top of GFS?

Also, I think the acronym “BFS” is a bit of a stretch for the name “Byzantine-fault-tolerant File System.” More like BFTFS.

Application
In today's world, it is very important to safely handle machines in your network exhibiting arbitrary behavior. Though some of their assumptions about machine failures may be iffy, I think the system presented in this paper is a good step forward in dealing with this type of failure and providing guarantees in spite of malicious behavior. Furthermore, improvements in hardware and software might make parts of the system even more applicable; for example, perhaps they could get away with digitally signing all messages rather than mostly using MACs.

Posted by: Ben Farley | February 17, 2011 01:11 AM

Summary:

The paper describes a detailed replication algorithm which tolerates Byzantine faults.The algorithm distinguishes itself against previous works as it works in asynchronous scenarios like the Internet. The algorithms provides safety and liveliness with the assumption of an upper bound on the number of faulty replicas,the number being floor((n-1)/3). The algorithm is supported by an implementation and evaluation of a NFS server which is Byzantine fault tolerant.

Problem:

Frequent malicious attacks from the Internet and increasingly common bugs in the current software cause faulty system with arbitrary and hard to predict behaviour.Such faults are called as Byzantine faults.The paper presents a practical algorithm for such Byzantine faults which can work in asynchronous scenarios like the Internet unlike the previous works.

Contributions:

The major contribution of the paper is the practical algorithm for byzantine faults in a system with 3n+1 nodes and at most n faults. This was a different from the existing work which usually provide a synchronous solution to the problem.The NFS implementation of the algorithm shows it applicability and performance in the real systems.
The algorithm has three phases for replication and involves a primary with changing views of the system.The messages digest are used to keep the messages short and public keys are used to identify the senders.The protocol uses check-pointing for removing old log messages and earlier checkpoints. Also,the algorithm by itself ensures safety by ensuring that each message has a unique sequence number.Liveliness is provided by changing views if they are unable to process a request. This is done by exponential back off times for changing views.

Flaws:

As discussed in the class,one of the flaws of such a system is the practicality of th occurrence of the problem.Often in the real system it is hard to restrict the Byzantine fault to f in a 3f+1 nodes system.Often the entire rack of systems crash or buggy code is pushed which crashes the entire system instead of a small part of it.

Application to real world systems:

It is hard to say if such a system is used in the real system,specifically when the failure is often not restricted to the a fraction(1/3) of the size of the system.However,such system are useful in security of the systems.The algorithm can resist foreign attack like malicious code or viruses effecting a part of the entire system.However,the various techniques employed for the practicality of the solution like cryptography,message digest,changing views and garbage collection are used very frequently in the real systems.

Posted by: Ishani Ahuja | February 17, 2011 01:29 AM

Summary

This paper presents an algorithm which guarantees correct and live operation of a replicated state machine in the presence of Byzantine failures. The system is called "practical" because it can operate in asynchronous networks which may drop messages or deliver them out of order, and because it incorporates several optimizations to achieve performance comparable to a non-replicated service. A replicated version of NFS was developed to demonstrate the fault-tolerance protocol, and was benchmarked to measure overhead introduced by the protocol.

Problem

A replicated system can encounter many types of failures, not all of which are of a well-understood nature like fail-stop. To be truly fault-tolerant, a system must tolerate arbitrary behavior from failed nodes. Such nodes may send inconsistent or innacurate messages or arbitrarily delay responses, either because of bugs or because they have been compromised by a malicious attacker. Previous work by Lamport addressed the problem of handling such Byzantine failures.

In order to tolerate Byzantine failures in a real-world replicated system, it is neccessary to also deal with an asynchronous network which is not 100% reliable and deterministic in its delivery of messages (they may be lost or arrive out of order). The implementation of a fault-tolerance protocol must also not be so complex in terms of messages exchanged or computation required that it winds up substantially degrading the performance observed by clients using the replicated service.

Contributions

The paper defines a protocol for a replicated state machine that guarantees correctness and liveness so long as less than 1/3 of the nodes are faulty. The system is, at any given time, in a "view" where one node acts as the primary and others as backups. A client contacts the primary, which then broadcasts pre-prepare messages to the backups. All nodes exchange and log prepare and commit messages, and all nodes send replies to the client (the client knows its request is successful when f+1 replies with the same result are received). Cryptographic message signing or, alternatively, message digests salted with shared secrets are used to prevent forging of messages. A checkpointing system is used to bound the amount of logging required at each node. In case of a faulty primary, other nodes can notice large delays without successful operation, and the non-faulty nodes can agree to advance to a new view, cycling to a new primary.

To improve the performance of the system, the authors introduce several optimizations. Communication can be reduced if the client has only one replica send a full result and the others send only the digest, and if replicas support tentative execution before commiting that can be rolled back. The high computational overhead of signing and verifying messages with public-key cryptography can be avoided if messages instead use message authentication codes, which are a digest (in this case MD5) of the message concatenated with a shared secret (every pair of a client or replica node and a replica has its own 16-byte shared secret sessions key),

A replication library implementing the described protocol was created, and it was in turn used to create a replicated version of an NFS filesystem. The system was benchmarked and found to perform within 3% of the latency of the standard non-replicated NFS implementation in Digital UNIX.

Flaws

The authors assume that node failures are independent, and suggest having different software (possibly via N-version programming), different configurations, and different administrators for each node. In real life, a replicated service is rarely ever built with such heterogeneity, and correlated failures are possible. A remote exploit common to the software on multiple nodes could allow an attacker to take over more than an otherwise reasonable limit of less than 1/3 of the nodes in the system quickly and break the protocol.

The protocol also seems to be useful primarily in the case where replication is being done specifically for fault tolerance or high-availability. The requirement that every node (or, at least, every non-faulty node) holds a complete copy of the system state and participates in every request rules out this protocol's use in clustering for load balancing and high-performance/scalability.

Applicability

Tolerance of Byzantine failures is important because many different types of common failures exhibit Byzantine behavior, including buggy or improperly implemented software, flaky network links or server hardware, and nodes compromised by malware or a malicious administrator. A replicated service that is resistant to arbitrary behavior of some fraction of its nodes is resistant to all of these various types of failures so long as they don't affect 1/3 or more of the nodes.

Posted by: Craig Chasseur | February 17, 2011 01:59 AM

Summary

This paper describes a Byzantine fault tolerant replication algorithm that works asynchronously . In addition to proving the safety and liveness conditions, it also explains strategies that make it more efficient than the other such similar algorithms.

Problem
As the malicious attacks and software bugs tend to increase, the faulty nodes would exhibit arbitrary behavior. There is a need to tackle the byzantine behavior in todays world where more and more governmental and industry rely on online services. However an algorithm which is slow or synchronous could impact performance. Thus, all that is required is to devise a byzantine fault tolerant algorithm that works in a realistic asynchronous environment like the internet and practical to implement.

Contributions of the paper

The paper describes the motivation for achieving Byzantine fault tolerance. The system under consideration is described. It is rather a real system which has network delays and faulty nodes may behave arbitraily. A concise theoretical explanation about the safety,liveness requirements and choosing the number of replicas in the system is described.

The main focus of the paper is on the state machine protocol that tolerates Byzantine faults in asynchronous networks like the Internet. Typically the client sends a request to the primary backup and the primary multicasts the request to the backups. If the client receives f+1 (f is the number of faulty nodes allowed) same result, then the operation is complete. The entire protocol depends on the message logs on every backup. The entire process is divided into 3 phases: pre-prepare,prepare and commit. These phases are needed to prevent out of order delivery and to conserve bandwidth by sending just the message digest and each replica checkpoints until it has 2f+1 for sequence number with same digest signed by different replicas.

Safety is ensured if the non-faulty nodes agree on the sequence numbers of requests that commit locally and liveness is ensured by moving to a new view if a replica couldn't execute a request(which are typically achieved through timers).

Various performance optimizations are done by designating just one replica to send the large reply and others just send the message digest. Also tentative executions are done and in case view changes, the replica moves to last stable checkpoint and the major win is in read-only transactions no state information is maintained.

Finally, the implement the algorithm for NFS and measure the new performance. They show that the new replicated state service is just 3% slower than a standard single NFS non-replicated server.

Concerns

As we discussed in previous class, failures rarely happen independently. They tend to happen together or correlated. I feel the assumption made about the paper about faulting independently is not really a "realistic" situation. The paper also admits that it doesn't address the problem of fault-tolerant privacy. However it is not fair to expect the paper to solve all possible problems with respect to malicious behavior.

Relevance to modern systems
The paper has a strong and clear motivation- a need to tackle arbitray behaviour with increase in malicious attacks and software bugs. Hence a practical approach is absolutely necessary. From some research in the web, I got to know that the practical byzantine fault tolerance is used with protocols like HQ(Hybrid quorum),QU and Zyzzyva.

Posted by: Karthik Narayan | February 17, 2011 02:03 AM

Summary:
This paper presents a practical state machine replication algorithm that can tolerate Byzantine failures. The protocol assumes asynchronous distributed systems with connected nodes wherein the network is not reliable (i.e. it may delay, duplicate, fail to deliver or deliver out of order messages), nodes are byzantine and fail independently. The total number of replication is 3f+1 where f is the maximum number of faulty replicas. The algorithm works as follows:
A client sends a request to a primary replica which goes through a three phase protocol involving pre-prepare, prepare and commit phases and
multi-casts the request to backups. Replicas execute the request and send replies directly to the client. Client gets result when it receives same
reply from f+1 different replicas. The state of each replica is stored in a log which is garbage collected (in order to keep a manageable size of the log). The authors give a proof that the algorithm satisfies both safety and liveness and discuss some optimizations. Finally they implement BFS (Byzantine-fault-tolerant NFS) and measure its performance against an un-replicated NFS.

Problem:
Most papers prior to this, that attempted to handle the Byzantine Generals problem are formulated under the assumption that the system is a synchronous one. However, in the real world (read Internet), this assumption is usually false. Packets in internet get delayed, lost, duplicated, out of order, or experience node crashes all the time. Hence the solutions proposed in the earlier papers are impractical.
(It is worth mentioning here, that this paper also assumes some sort of synchrony for guaranteeing liveness - albeit a weaker one)

Contribution:
This paper bridges the gap between theory and real-world application. The first state-machine replication protocol that survives Byzantine faults in asynchronous networks is proposed. Furthermore, they perform a bunch of optimizations to make their protocol practical and work in practical scenarios.
Finally, this paper also describes the implementation of a Byzantine Fault Tolerant Distributed File System (they call 'BFS')

Flaws:

Protocol: The paper assumes that link failures can't occur. The assumption that all the nodes are always connected is something that does not gel well with the real world scenario. In real world, packets could get lost even between non-faulty nodes since link failure is much more common than node failure (Infact as per a study the most common type of network failure is link-failure).

Also the assumption about independent node failures is too strong an assumption for real world scenarios wherein fires gutting down hundreds of nodes in a data-center is quite possible.

Implementation:
It is quite counter-intuitive that they use UDP instead of TCP for sending messages to replicas.The authors themselves acknowledge that recovering from lost messages is expensive and they have to resort to retransmissions. This is true especially for congestion at high values of f.

For a paper that highlights the word 'practical' in italics, their 4 replica (one fault setup) with no view changes or adversial elements does not quite add up. Real systems rarely have such perfect conditions. one could also argue that their system is not designed for scale, but then a replicated state machine never intends to solve this problem anyway.

Relevance: Given the motivation for the paper to design a practical system and the number of papers still being published on Byzantine failures, this paper seems extremely relevant today. While this paper follows a replica-based approach (which with its quadratic cost of inter-replica communication is unnecessary especially when there is no contention), there are other approaches like client-controlled quorum-based (which require a large number of replicas though) and the more recent hybrid ones (HQ) that use and build upon some of the ideas presented here. Most of these newer implementations and protocols also scale well btw.

Aside: How does one go about determining "f" ? One could over-estimate "f" (and hence 3f+1 replicas) and run the risk of a performance degradation (as the paper notes, use of additional replicas can degrade performance (since more and bigger messages are being exchanged)) or under-estimate "f" in which case run the risk of incorrect computations, no safety guarantees and what not.

Posted by: Rohit | February 17, 2011 05:47 AM

Summary
The authors of the paper present a practical algorithm for state machine replication offering liveness to a client in light of Byzantine failures and network asynchrony.

Problem
In a world that is becoming more connected, more and more client services are being moved online. An unfortunate side effect of this shift in service provision has resulted in increased malicious attacks as well as arbitrary failures due to the asynchronous and unreliable nature of Internet. Consequently, in order for service providers to improve reliability and guarantee liveness of a service, service providers will be required to utilize replicated services capable of withstanding Byzantine faults.

Contributions
The paper makes several contributions to algorithm design for state machine replication. As most of the prior work to solving problems related to replicated state machines focused on the theoretical feasibility on solving the problem, most of the algorithms proved to be impractical for use. As such, the foremost contribution made by the authors is to provide a practical algorithm for state machine replication capable of withstanding the asynchrony of the Internet. Additionally, the authors offer several optimizations utilized within the system, in order to improve performance such as message reduction and system authentication without reducing correctness or security. One final contribution provided by the paper is an implemented prototype of their algorithm, in the form of Byzantine fault-tolerant NFS (BFS) in order to evaluate their algorithms performance and overhead.

Flaws
One possible place of contention within the paper is the authors’ requirement of independent node failures. While most of the suggestions such as multiple administrators and unique passwords are feasible and easily implemented, their suggestion of N-version programming of services to prevent independent failures can be a point of concern. It is very likely that most firms will not undertake the high cost and extra hours to develop independent versions of a software (with the exception of life critical portions such as flight control software) in order to reduce the likelihood of single point of failure. Furthermore, many researchers believe that in spite of independent development there is a high likelihood that teams will make similar mistakes within difficult sections of a program resulting in “simultaneous” failures.

Applicability
Based on the performance results of the BFS, the algorithm offers low overhead and consequently minimal impact on overall system performance. These are extremely promising and attractive results which suggest the practicality of Byzantine Fault Tolerant Solutions (BFTS). However, as mentioned in the Flaws section, the practical Byzantine Fault Tolerant Algorithm is unable to cope with most software errors, limiting the efficacy to hardware failures and N-version program implementations.

Posted by: Greig Hazell | February 17, 2011 07:06 AM

This paper mainly discusses a practical replication algorithm which can tolerate byzantine failures of atmost ‘f’ nodes given that total replicated nodes in the system are atleast 3f and also maintaining two system invariant namely safety and liveness. To the end, paper discusses about the implementation overhead incurred when algorithm is implemented as a library.

In today’s world where people rely more on online storage, it becomes important for online services to be fault tolerant. These services face with three type of faults 1) Hardware failures ,2) Software failures and 3) malicious attacks. In the present state of the art, it is possible to build hardware systems that are highly reliable, but software is becoming more complex and highly prone to errors, likewise malicious attacks are becoming worse in their extent and impact. These faults can be termed as byzantine failures. Previous approaches to tolerate byzantine failures had two limitations - 1) they needed synchrony in the system and 2) there were practically too inefficient. In this paper Miguel and Barbara present a practical replication algorithm which solves these limitations. Only assumption they make about a system is that failures are independent - ( by having different versions of code running so that probability of systems failing at the same time due to same bug is minimum).

Important contributions of the paper:

1) Description of state machine replication protocol - a ) the concept of view b) algorithm’s control flow - client requesting a service from primary, primary multi-casting the request to replicated nodes and replicated nodes sending replies back to the client. c) Usage of three phase protocol - pre phase, prepare and commit which will guarantee that all non faulty nodes agree on a particular total order for execution of requests by maintaining sequence numbers for the request.
2) Replication algorithm provides both safety( operations performed by clients are observed in a consistent way by non-fault nodes) and liveness (clients eventually receive replies).
3) The concept of view changing to maintain liveness in the system, at the same time restricting the rate of change of views which can impede the progress of the system.
4) Optimizations in terms of communications and security - a) batching the requests at the primary b) Optimistic execution ( replying the message without executing it ) c) by sending only one reply message with data and rest other being digest messages d) reducing number of message delays. e) increasing read only performance by making client broadcast request directly to all the replicas
5) Usage of MAC’s in place of digital signatures when required
6) Usage of stable checkpoints to synchronise the state of system. Also during the transfer of checkpiont data, sending updated information rather then every thing.
7) Provision of runtime library at client and replica side and its runtime evaluation. It turns out that present algorithm incurs atmost 3% overhead compared to NFS without replication.

Flaws:

I don’t see any flaws in the system. Even though there are lot of messages passing around the system which can limit scalability - they are necessary for proper functioning of the algorithm. Also they try to optimise in every possible phase of the algorithm. This algorithm processes thousands of requests per second with sub-millisecond increases in latency.
I am a bit skeptical their assumptions - limiting number of faulty nodes in the system and failures being independent. Do real systems follow these assumptions.

Application:

Wikipedia quotes “ PBFT triggered a renaissance in BFT replication research, with protocols like Q/U, HQ, and Zyzzyva working to lower costs and improve performance and protocols like Aardvark working to improve robustness. ” . I think some of the concepts presented here like 3 phase protocol can be used to synchronise events in transaction processing , file system operations.Incremental encryption, usage of MAC’s in place of digital signatures can be applied to scenarios where overhead of digital signatures can be avoided.

Posted by: Pratima Kolan | February 17, 2011 07:32 AM

Summary: This paper describes an algorithm that provides fault tolerance against byzantine failures. Unlike previous work this algorithm can work in asynchronous environments like the internet. The performance of the algorithm is shown to be better than previous work by use of several optimizations described in the paper. The authors demonstrate the applicability by implementing the algorithm in NFS.

Problem Statement: Replicated state machines are used in cloud like environment to safeguard against malicious attacks or software faults that tend to push the system in an inconsistent state. The algorithms developed previously so far have been theoretical and the practical ones assumed synchrony in their implementation. This paper tries to provide a more practical solution with more safeguard and liveliness compared to previous solutions.

Contributions: The algorithm described offers liveliness and safety provided f=(R-1)/3 where f is the number of faulty nodes and R is the total number of replicas. The replicated state machine assumes that the system operations are deterministic and that the replicas must start in the same state. The algorithm achieves its claim of safety by streamlining all the non-faulty replicas to a total order on the execution of the incoming requests. It achieves liveliness by moving the replicas through a succession of configuration called views. In a view, there is a primary and all others are the backups. Clients send request to the primary. Then the primary multicasts a pre-prepare message to all the backups and logs its message. After verifying the pre-prepare message, the backups send each other a prepare message. Each replica is considered prepared if they receive prepare messages from 2f+1 replicas. Once prepared the replicas initiate the commit-phase and sends commit message to all the replicas. After commit phase all the replicas send the results back to the client which waits for f+1 same replies before accepting the result. The use of sequence number, view numbers and message digest makes the algorithm more practical in terms of achieving the claims of applicability in asynchronous environments. Certain optimizations like shorter protocol messages, tentative execution, and use of MAC rather than costly digital signatures make the algorithm faster for most common cases.

Flaws: The algorithm works only for independent failure and the example provided in the paper to clarify what independent failure meant was quite naïve. I cannot imagine why some software solution would like to have low degrees of replication that too with use of different service code and operating system. Aren’t non independent failures more serious than independent ones? I would like to know how they have been dealt in research so far.

Applicability: This algorithm seems to be highly applicable in cloud based services which are becoming more and more common. I wonder at what level in the software stack are these techniques implemented in real services. May be that is not so important but it would give me a clear picture of how these things work.

Posted by: Paras Doshi | February 17, 2011 08:36 AM

Summary:

In this paper the authors try to implement a Practical Byzantine Fault Tolerant system. Their proposed algorithm works in asynchronous environments and is faster than the previous approaches. Specifically, their implementation of NSF runs 3% slower than the unreplicated NFS. The guarantees in the paper hold only at most [n-1/3] of the replicas are faulty.
The algorithm is a form of state machine replication and guarantees safety and liveness. The state is established and monitored through the messages that are sent using this algorithm.

Problem Statement:

The problem that they are trying to solve is that the state of the art solutions at that time for having Byzantine Fault Tolerance are not practical. The reason for impracticality is that either they make unreasonable assumptions about the environment where the protocol will be running (ex. synchrony), or that the solution is prohibitively slow.

Contributions:

As it is clear from the title of the paper, the main contribution is to build a system that is Byzantine Fault Tolerant but can be used in practice.

This system can work in asynchronous environments

Performance overhead seems to be not prohibitively large

They have not made strong assumptions about the environment that their system will run in. For example, delays and out-of-order delivery is tolerated.

A strong adversary model is chosen. Adversary can coordinate faulty nodes, delay communications, etc.

Flaws of the paper

I am not sure if the following weaknesses can have a feasible solution or not, however, it seems reasonable to me that they should be addressed if a system wants to be practical.

They assume independent node failures. However, as the Professor mentioned in class with the case study in Yahoo!, it seems that node failures are not totally independent. In fact, once something goes wrong, a lot of the nodes are affected. The proposed solution to use different implementations of software seems reasonable but is hard to achieve due to high overhead in agreement, maintenance, and training costs. N version programming also has been proven to have deficiencies as some corner cases are not covered by most of the programmers. Thus, it seems that removing this assumption might make their system impractical in the real world.

Assuming deterministic responses from the replicas also seem to be a strong assumption. They do mention this can be solved by adding an extra phase to coordinate the responses from the replicas. However, to me this increases the overhead significantly.

Using public key infrastructure for signing seems to be very inefficient. They could use shared secrets that are much faster, but that would mean every 2 node should have a shared secret. Maintaining these secrets might make the solution impractical. As far as I know, in today's data centers such signature mechanisms are not used and keeping the adversary out of the data center is the firewall's or the Intrusion Detection System's responsibility. Why are they not able to assume that such a technique exists and eliminate this overhead?

Applications

Availability is very important for many services today. The fact that you are able to continue to provide service even though many of your nodes are down is in some cases critical (ex. for an emergency service provider) or is strongly required to remain competitive in the market (ex. Web search service providers).

For the same reasons, performance is very important. You should be able to respond quickly to services both when you are operating normally and when you are having some faulty nodes. Therefore, low overhead is very important for these systems. Most companies and critical service providers currently have some sort of replication in order to mitigate the availability problem, but it seems that there are not guarantees. If one could introduce a solution that provides guarantees such as the ones introduced in this paper with low overhead, it would be certainly used.

I am not sure if there currently exists a solution for Byzantine Fault Tolerance that is being used in practice and would be interested to know that.

Posted by: Fatemah | February 17, 2011 09:12 AM

CS 739 - Reviews - Spring 2011

Practical Byzantine Fault Tolerance

Comments

Post a comment