« A cost-effective, high-bandwidth storage architecture | Main | Design and Architecture of the Microsoft Cluster Service -- A Practical Approach to High-Availability and Scalability »

Why Do Computers Stop and What Can Be Done About It?

Jim Gray. Why Do Computers Stop and What Can Be Done About It? . Tandem Tech Report TR-85.7, June 1985.

Reviews due Thursday, 4/12.

Comments

Summary:

The paper discusses some of the causes for failure of computer systems and the ones which have the most impact. Fault tolerant hardware is able to provide a very high availability, it is software, operators and system misconfigurations which cause more outages. Techniques to improve software reliability are discussed so that the MTBF (mean time between failures) or MTTR (mean time to repair) improves.

Problem addressed:

Conventional systems at the time had a MTBF of two weeks which was unacceptable for many systems. Fault tolerant systems were an order of magnitude better and they could provide hints to better system design in either improving the MTBF or MTTR.

Contributions:

The paper identifies that hardware can be relied on by the current state of the art fault tolerant techniques and hardware is getting better. Any further reliability in the hardware is offset by the flakiness of other system components.
The paper makes a convincing argument about software systems becoming the primary cause for failures. Software systems continue to grow at a prolific rate with lot of code added each day. Bugs are hard to guard against. Author presents lessons learnt from good techniques of hardware fault tolerance which can be applied to software.
The paper also identifies that most bugs are "Hiesenbugs" - non-repeatable and hence can be tackled by process pairs, modularity and fail-fast techniques.
The author points out that system misconfigurations/operator errors etc are subject to human operators diligence and the only improvement that can be made here is via automation.

Flaws:

The empirical data presented here is not as formal as most papers we have read earlier. Lot of assumptions are made and everything appears approximate.
The whole study then appears informal and subjective, still the ideas presented make sense intuitively.
Treating everything as transactions is a difficult proposition as a design choice and as the author points out, the kernel/OS is hard to program in that manner.

Performance:

Main goal of the paper is to improve reliability by identifying causes of failures and measures that can be taken. Either increase the MTBF via replication, checksums, etc or decrease the MTTR through a host of techniques.

SUMMARY
In "Why do computers stop and what can be done about it?" Jim Gray analyzes causes and anatomy of computer failures, and presents a range of techniques to create systems with extremely long MTBF.


PROBLEM
Computer failures are intolerable in certain environments. Given that software, hardware and people inevitably will fail at some point, how should systems be designed to maximize their MTFB.


CONTRIBUTIONS
* anatomy of a system failure: how hardware, software and users react to failure

* extremely fast MTTR can hide failures from users

* redundancy should be used in conjunction with modularity

* design criteria for fault-tolerant hardware (modular such that each module: has MTBF in excess of a year, is fail-fast, has detectable failures, redundant with fail-over)

* observation that most failures are caused by software and administration

* different approach to maintenance of different components (fix hardware bugs ASAP, avoid software updates until they are necessary)

* fault-tolerant execution: modularity, containment, process-pairs, transactoins


FLAWS
* a lot of guesses without much justification

* I disagree that hardware necessarily gets better with time. If increasing software complexity causes more bugs, hardware complexity increase should also result in worse reliability.

* I disagree with the notion that self-configured systems with minimal maintenance is a practical panacea for system instability. Such systems may become too complex (hence bugs in vendor code) or inflexible (users can't do everything they want).


PERFORMANCE
reliability, availability, MTBF, MTTR

Summary
This paper is an analysis of faults on the Tandem system, and discusses software fault-tolerance and various solutions to dealing with faults.

Problem
The problem that this paper is attempting to address is that the level of failures being experienced typically in transaction systems and the downtime associated with faults is unacceptable (90 mins every 10 days) and unnaceptably interrupts the operations of customers.

Contributions
* Obeservation that operator and software faults are responsible for the majority of faults, as opposed to hardware fault issues.
* Categorization of faults
* Proposal of mechanisms for dealing with these faults and analysis of them. These include software modularity through processes and messages, fault containment through fail-fast software modules, and the assertion that most software bugs are soft in that they go away when you look at them, and process-pairs for fault-tolerant execution.
* Reccomendation of different strategies for dealing with hardware and software bugs.

Flaws
* The author made too many assumptions and did not adequately define failure.
* Use of bugs per lines of code is maybe not the correct way to categorize bugs.

Relevance
In spite of the flaws in the paper, the message is still relevant today.

Summary:
This paper analyzes the main causes of a failure of a commercial system based on some studies and observations and points out that, software failures is the critical point. Then he describes few approaches to reduce system down time such as persistence process pairs and transaction.

Problem Addressed:
The users� availability requirement for the system was increasing higher and higher so that the system will hopefully never fail from the users� point of view (99% uptime is too low). To improve the availability to that level, well organized analytics of the commercial system failure and straight forward and powerful solutions which does not require system administrators or programmers to do more complex things were needed.

Contributions:
The analysis that clearly points out the real major cause of failure is valuable to figure out a new direction improvements to the system to get better availability.
By combining and organizing existing mechanisms so that could be used for system improvement makes the solution simple and still convincible even the description is in a high level.
Simple and straight forward classification in failures, solutions, observations, and overall the paper seems to be one of the power of this paper but on the other hand it should be correct and need many supporting ideas or proves.

Possible Improvements:
As an analysis of Tandem system, it is a good observation and analysis but it might be much better if they could have also pointed out some other system to compare or look if the solutions could be applied to those too.
Even though this is a really high level and conceptual paper, it could be more powerful with some experiments and results using the solutions described in the paper.

Performance:
This paper clarifies the directions of improvements for existing systems to go to the next step in the perspective of reliability and availability. Then collecting, organizing and applying concepts such as modularizing and transactions to create a simple and convincible solutions makes easy to imagine the next step in system administration and programmers point of view.

Summary
The article analyzes causes on failure in a fault-tolerant Tandem system, and then summaries techniques to avoid the most common failures.

Problem
No one has researched why fault-tolerant systems fail. By understanding why they fail they can be improved to further increase fault tolerance.

Contributions
In order to have MTBF in the range of years components must be modular and fail-fast. The article begins by observing how these techniques have been successfully applied to hardware design to increase MTBF greatly.

The author then analyzes the causes of failures in a recent Tandem system (with a MTBF of 7.8 years). Surprisingly administration and software account for many more failures than physical problems.

Most of the failures due to software bugs were due to bugs that are infrequently tripped. Since a specific state is required to crash the process the author proposes using a pair of processes in order to decrease the MTTR for the most common bug. Switching processes when one faults leads to many problems with the system being left in a bad state. Transactions are proposed in order to ensure the system is never left in an inconsistent state.

Possible Improvements
The paper could have benefited from a bit more precision. Failure was never formally defined in the article. Due to the dump/restart discussion in the beginning I assumed that any unplanned restart was counted as a failure. Without a precise definition of what a failure is the failure statistics are not as motivating. In many cases the author used estimates without much reason other than his experience. Such as, only 50% of system failures being reported.

Many benefits of reliability techniques are discussed, but the costs are not mentioned. Without an idea of how much overhead is introduced by a new algorithm it is hard to analyze if the new idea is worthwhile. Even a rough estimate of overheads introduced would have been nice.

Performance
Without doubt this article focused on improving system reliability.

Summary:
The paper discusses about different types failures that happens in a system (a fault-tolerant system, to be precise) and different solutions to improve the reliability and availabaility of the system.

Problem:
Though there have been many studies about system failures, most of them were focussed on conventional systems. Gray's study of commercial fault-tolerant systems showed that the failures are much less frequet there (as expected), but the ratios of failures among different types of failures remained the same. Software faults and administrative errors are much more than hardware failures. In this paper, Gray takes a cue out of hardware, and proposes using modularity and redundancy in Software as a way to improve software reliability.

Contributions:
- Using proven concepts from another domain to solve a problem in Software - in this case, modularity and reduncancy used in hardware to improve reliability.
- Proposal and analysis of different solutions for fault-tolerance in software - lockstep, different types of checkpointing, process pairs.
- Re-use of the concept of 'transactions' and 'sessions' from Database domain for fault-tolerance - in execution, communication and storage.
- Categorization of faults on a commercial system using real data from the field, to prove his point - that software and administrative faults are the main contributers of system failures.


Flaws:
- Overall, the paper is at a very conceptual level. No real details are provided on design and implementation of any of the solutions proposed in the paper
- It is not clear on how the concept of 'transaction' can be applied to all situations. Some activities cannot be reversed - for e.g. a print spooler cannot do anything about the pages it printed or a CD burner cannot do anything about a CD partially written.
- The paper is based on just one type of system (a commercial system made by Tandem Computers). It is not a good idea to act based on just one type of application based on one type of system.
- The study completely discounts failures in 'infant' systems. There could be much more than 'faulty compenents' on such failure - including installation problems, ease of use (or how well operators learn the system), etc. Though 'infant' failures are one third of all failure, study seems to have avoided it completely.
- The paper talks about support from kernel/OS at different places. I am not clear why the solutions mentioned in the paper really would require support from kernel (of course, other than facilities that any general purpose OS gives)

Relevance:
The ideas in the paper are quite relevant, and would definitely improve software reliability (at extra cost, and possibly, performance). I would imagine that it is possible to create highly reliable systems based on these concepts. However, with the lack of implementation details in the paper, it is hard to say if it is possible to build a general-purpose, fault-tolerant system.

Summary

In this paper, the author mainly analyzes various factors that can result in system failures in commercially available fault-tolerant systems. It is claimed there is 25% chance that a failure has occurred due to software problems. Towards the end the author presents various software fault-tolerance techniques that can result in reducing the system failures due to software bugs.

Problem Description

One of the obvious solution incase of large system failures is to restart the system. It takes around 90 minutes after restart that the system achieves a stable state. This 90 minute outage is not acceptable given that fact that it can occur once every 10 days. In order to reduce the frequency of these outages, the author suggests using modularity and redundancy while developing software in order to make the software fault-tolerant.

Summary of Contributions

Some of the contributions of the paper are as follows.

1. The author classifies system failures into 4 major categories i.e. failures due to administration, failures due to software, failure due to hardware, and failure due to environment. By doing so, the author gives a very clear picture of how frequently system failures occur because of any of these factors.
2. The paper also discusses a model for software failure i.e. Bohrbug/Heisenbug hypothesis. By using this model, the author shows that most of the bugs are of Heisenbugs that go away once you look at them. This model is quite useful because it suggests that more focus should be put on solving Heisenbugs at runtime.
3. The author also discusses various techniques for developing fault tolerant software e.g. process pairs, transactions for data integrity, fault tolerant communications etc. These methods are quite useful because they result in fault-tolerant software which reduces the frequency of system failures.

Flaws

One of the major flaws in this paper is the author�s methodology to classify causes of system failures. Firstly, the author has based his results by looking at the data from only one system. Secondly, the author himself is a bit dubious about the correctness of the data. Finally, the author has made a lot of assumptions without justifying them.

Performance

The main goal of this paper is to present a software implementation model that can make the software more reliable in order to reduce the frequency of system failures.

Summary:
This work summarizes and analyzes the major failure mechanisms in computer systems and offers suggestions on how to improve the Mean Time Between Failure(MTBF). Its shown that hardware generally causes the fewest errors where software and administration is the larger contributor to system problems.

Problems Addressed:
The primary problem addressed is that of increasing the MTBF for a system. Redundancy has been proposed as a way to improve a systems MTBF but according to VonNeumann a very large level of redundancy is required to make a difference in the MTBF, the authors address this issue with a straightforward and simple solution.

Contributions:
An effective means of improving system fault tolerance is to combine both redundancy and modularity. A system divided up into smaller parts is more reliable since its not likely multiple modules will fail at the same time as long as the modules are independent. The authors present several aspects of a fault tolerant system. 1) Modularize the system, 2) Design modules to have relatively long MTBF, 3) Make modules do the right thing or fail completely, 4) Make modules alert other parts of the system to a failure(heartbeat), 5) Use redundant modules that sit idle ready to takeover for a failing module. These principals apply mostly to hardware however very similar principles apply to software and revolve around the notion of a process-pair that provides consistency and verification checking during calculations. Several methods for implementing the process-pair are presented and the persistence approach is chosen along with the use of transactions. The persistence approach is easy to program but will forget state during a failure however by utilizing transactions the state can be restored to what it was before the fault since the transaction knows how to undo things. The two mechanisms together make for a relatively simple means of implementing the process-pair idea.

Flaw:
Its mentioned that often times bug fixes are not installed right away due to the risk of change and that since everything was working it should continue to work. However in an increasingly more wired world with more network inter-connectivity there may be more outside forces acting on a machine which could provoke bugs within the system. This is of course especially important for security issues.

Performance:
Reliability is obviously a critical component of this work and is achieved though the careful decomposition of a system into modules after which a certain level of redundancy is implemented that increases the MTBF by orders of magnitude.

Paper Review: Why Do Computers Stop and What Can Be Done About It? [Gray]

Summary:

In this paper, Gray presents some results, admittedly somewhat
subjective, of a study on the causes of failures and reduced
availability in Tandem computer systems - ostensibly the most reliable
commercial computing systems of the time. He then presents a position,
based on those observations, that software failures are more often the
root cause of low availability, and this can be improved by applying
to software the high-availabilty techniques that are effective in
hardware in the form of process-pairs and transactions.

Problem:

The problem is that without much empirical evidence to motivate them,
systems are often designed without attention to fault-tolerance.
While not explicitly stated, the presumption too often is that software
can be made to be bug-free rather than acknowledging that it is more
failure-prone than hardware.

Contributions:

This paper makes a strong point that system adminsitration
(configuration changes) and software dominate the causes of failure
in systems. The key to high-availability is tolerating operations
and software faults.

System components don't have to be perfect - you can design for
high-availability by acknowledging component failure.

Flaws:

Discussion with my fellow reviewers convinced me that the paper is flawed in that it is very loose with its notion of what is a "failure", and the authors notion of what it is seems to change throughout the paper.

The paper is poorly titled because it proposes that its about computers
stopping, but really its about (I think) application availability and
cheerleading to apply fault-tolerant techniques in operating system
and application software.

Performance Impact and Relevance:

Although the commercial computing environment has changed significantly
since the time of this writing, I find everything Gray says to be
relevent today and fault-tolerant design is still a major concern.
Some of the things he suggests, such as (1) process-pairs using
the process as the protection domain, (2) using message passing, and (3) avoiding "infant" products, put constraints on performance.
That is, they run counter to performance-enhancing proposals with more
radical operating systems designs. This suggests that reliability
has a number of trade-offs that limit design and choice of systems
(operating system and hardware).