« Reliable communication in the presence of failures | Main | The Byzantine Generals Problem »

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance;Miguel Castro and Barbara Liskov,OSDI'99

Review due Thursday, 2/23.

Comments

As the title states, the authors are aiming for a practical system that implements state replication with Byzantine fault tolerance. Thus not only will they prove theoretical correctness, they will evaluate their algorithm with an implementation and spend time on optimizations that will speed up that implementation.

In their model, the authors assume an unreliable network that can drop, delay, or reorder messages and a limited form of asynchrony. Nodes can have delayed response but this delay is bounded. The presence of such a bound lets the algorithm avoid the impossibility results from the wait-free framework, in particular the impossibility of achieving wait-free consensus. These are both reasonable assumptions to make in modern systems where oversight and repair can bound halting time.

Their system uses three main measures to limit the adversary's ability to do damage. First, all messages are encrypted and signed to prevent impersonation. Second, changes require a 2f+1 consensus before being accepted (where the system has 3f+1 nodes and only f nodes can fail). Third, while one primary node is selected to direct operations, each node observes its behavior and can initiate replacement proceedings if it becomes suspicious.

The final protocol is quite simple and clean. It does require some intelligence on the part of the client (in particular, knowing the value of f). The few operations used can each be optimized separately. Thus, while the cryptography overhead is a significant factor in the delay, it can be isolated and in fact does not dominate the delay. Simulation showed that in comparison to the NFS protocol, the authors' protocol operates at 1.03 times the speed. This small a difference means the protocol is definitely feasible for industry applications. The authors' attribute the major speed gain to distributing the work via replication where NFS must write to disk.

Three things that I wish the authors had included in their report. First, there is no indication of how well the system performs when scaled up or how far it can be scaled. By selecting a single primary/multiple backups model, an intrinsic limit on system size has been set. Can the protocol be changed to allow multiple primaries? Second, how well does it perform in the presence of a fault? I understand that Byzantine faults aren't easy to simulate but when comparing two different fault tolerant systems, I need to know how the performance of each suffers when they think a fault is occurring. Given this systems slow detection feature and incremental response, I expect the delay when a fault is expected to be quite high. Third, it was not clear to me after reading how machines recover from delayed messages that leave them out of a round of decision making. This last may be because I missed something in the article and not by any fault of the authors.

I feel there ought to be more faults in this protocol, mostly because Byzantine fault tolerance is a hard problem and this paper claims to have a simple solution that doesn't require much from its underlying system model.

This paper presents a description of the algorithm of a system that can tolerate Byzantine faults (i.e. some nodes of the system having arbitrary behavior). The algorithm offers liveness and safety (meaning the results would be the same as that of a centralized system executing the commands atomically) provided at most floor( (n-1)/3 ) replicas out of total n replicas are simultaneously faulty. The paper also mentions an implementation of a Byzantine-fault-tolerant file system (BFS), a modified version of NFS, using the algorithm. The experiments on that system shows BFS having an average 3% and maximum 35% overhead over normal NFS on emulated software development workload.

Unlike previous systems we have read, the algorithm is developed on a system model with potential malicious attacks, some of them can be adverse coordinating multiple faulty nodes or delaying communications. It also has little assumptions about the underlying network and does not require requests to deliver in order. I think it is a great contribution to develop an algorithm and a sample implementation with acceptable performance in this kind of adverse system model. In addition, the algorithm allows the messages to be authenticated using massage authentication codes (MACs) which can be computed three orders of magnitude faster than digital signatures. Although using MACs could be slower as the number of replicas increase, the authors made a choice that works well for reasonable number of replicas.

This system is opposite to Amazon’s Dynamo system in terms of consistency and availability. For every request, it requires at least 2f+1 nodes to be available, where f is the number of faulty nodes the system can tolerate. The higher the value of f, the more faulty nodes the system can tolerate. It increase consistency, but also make a request more likely to fail in case of network partition or failures. In addition, the algorithm requires frequent communications between replicas. Therefore, having more replicas without increasing f, for availability, would degrade performance.

The experiments only shows the measurements for the normal case where there are no faulty nodes. It would also be good to know the measurements for the case with different number of faulty nodes. The paper did not cover the recovery of the faulty nodes, but it is important for them to be recovered in a relative short time. Otherwise, the resubmitted failed requests could reduce the throughput of the whole system. I think this system is good for the cases where safety and consistency is important. However, for some services which require high availability and have geographically diverse server locations, it might not be suitable for the reasons in the previous paragraph.

Practical Byzantine Fault Tolerance

Castro and Liskov present an interesting implementation of a replicated nfs service that is byzantine fault tolerant. It's a valuable paper as it points out many areas for future study. Practical secure key exchange being one noticable point. It also shows that the benefits of replication can be had for a very small performance overhead.

One of the first things that popped out about this paper was that it claims it could support no more than floor((n-1)/3) fault nodes. According to the original paper with unforgeable messages the solution can survive up to m traitors ( theorem 2 ). It appears that the reason they require more replicas for every possible traitor is that they are not using public key messages ( section 5.2 ).

If you use public keys the implementation details are a little unclear. The public key for every server could be stored with the client, but that seems impractical as adding or subtracting a server would require maintenance at the client. Common key exchange algorithms like 'Diffie-Hellman' do not address the problem of being sure of the actual identity of the person. If you are unsure of the identity of the replica being talked to more impracticalities come up with the idea of how to connect to a replica. If the client connects to a replica via a dns name, what's to prevent the traitor from hijacking the dns system and directing all client messages to a traitor. Even if that was fixed the client could still be misdirected with any kind of attack like arp spoofing.

In short, the implementation presented is perhaps resistant to a naive traitor but not a more artful one.

A great positive to this design is that the algorithm does ensure that the effects of access revocation operations are observed consistently by all clients. The lack of consistency in the treatment of ACL's was a major complaint about grapevine.

The third optimization (section 5.2 ) listed is for read-only data. Are there any read-only operations? Even reading data from a file will end up altering the atime of the file.

There seems to be some interplay between settings that was not discussed. In section 4.2 it discusses that the primary will start their protocol immediately unless the number of messages for which the protocol in progress exceeds a maximum. Then in section 4.3 they mention that k in 'H=h+k' has to be chosen such that the system does not have to wait for a checkpoint to become stable. Is this the reasoning for the maximum from section 4.2? It seems that using a hard coded window size might not be appropriate. Some adaptive windowing scheme like that used in TCP would be interesting to pursue.

In this paper, the author proposed a replication algorithm that is able to tolerate Byzantine faults and the performance of the algorithm is tuned to be efficient to run on real systems.

In the replica state machine system, one copy is primary and others are backups. The primary accepts the request from the client and distributes it to all the replicas, after 3 phases of process and communication the replicas send the reply directly to the client. The client waits for a specified number of replies to continue. All the communications between replicas and the specified number of replies the client has to receive are designed to make sure that the system can generate consistent and correct result in the presence of no more than [(n-1)/3] fault replicas where n is the total number of replicas.

In case the primary replica fails, the other backups will detect it by time-out and send the view change request to all the replicas. Once a specified number of the replicas agrees, the view will be changed and a new primary will be selected to continue the work.

The purpose of the system in this paper is to work in presence of many faulty nodes, either controlled by attackers or failed itself for some reason. So each component of the system suspects the message it receives and try to get enough messages from different replicas to confirm a message. In this way the author handles the Byzantine failure. To make the system work in practice, the author did a lot of work to reduce the communication overhead. They used the MACs for authentication instead of the more cost RSA. They sent the digest instead of the whole request. But I think whatever optimization they made, the system itself has too much communication overhead which definitely will limit the scalability. I think in designing a system, we should not integrate the maintenance framework into the real work of the system. If we want to stop the attackers we should do something outside the system instead of in the system itself. If we want to detect random failures, we should detect it directly instead of doing it at the cost of losing the trust of all the replicas. Nevertheless I think it can work very well when the number of replicas is limited.

Also they implemented a Byzantine-Fault-tolerant File System, and tested it with the Micro-benchmark and Andrew benchmarks. Based on their comparison result, the overall performance of the BFS is only 3% worse than the standard NFS. I expected the performance to be worse because of the communication overhead. I think it’s because the replicas are wrong on the same machine, so the network latency is very low compared to a real distributed system. Also if the workload is heavier, the performance should be much worse.

In conclusion I think the algorithm in this paper is good choice when fault-tolerant is very important for the system.

Post a comment