« The Google File System | Main | Improving the Reliability of Commodity Operating Systems »

Why Do Computers Stop and What Can Be Done About It

Jim Gray. Why Do Computers Stop and What Can Be Done About It?. Tandem Tech Report TR-85.7, June 1985.

Reviews due Thursday, 11/20.

Comments

In, "Why computers stop," Gray discusses the origins and implications are of failure. As part of this discussion, he draws on the interviewed experiences of systems administrators of large cluster setups. Among other things, he determined that operator mistakes, then software, then the environment, and only then hardware was the order of the most significant contributors to machine failures.-

We gain from this a written expression of what it means for a system to be available and reliable. Intuitively, reliability is how long we can expect a system to run between failures, and availability is the fraction of time that the system is available over many such failures (thus including time to repair. Gray also points out some promising trends, the most critical being that software flaws tend to be soft. Frequent computer users know this intuitively. Usually, when something goes wrong, a simple restart of the application or of the machine sets everything to the right again. The programmer is increasingly familiar with this kind of bug also, where reproduction of the error is frequently more difficult than the fix. Another positive finding is that hardware tends to have a very high infant-mortality rate. This sounds bad, except that it infant mortality is the best kind of fail-fast behavior one can have; it fails before you've had a chance to rely on it for anything. Gray further posits that hardware will trend to increased reliability, moving the problem further into highly-malleable software. In software, Gray notes a few methods which we now employ together as often as possible: modularity, containment, redundancy, and recoverability. Of the five methods he suggests for recoverability, all are seen now, but the, "hard-to-program" downsides have occasionally been obviated by increased modularity (such as the separation of a database from the front of a web application.

What isn't promising about the outlook is that software is trending to the more complex, and to the more unmanageable (from a bug-finding perspective). Gray discusses the incentives (or lack thereof) for operators to update their installed software, and concludes that they are in direct conflict. On the one hand, new software introduces new bugs and more of them due to increased complexity. On the other hand, new software fixes the old bugs and offers more features (which I imagine might include a more resilient architecture). What's missing from Gray's discussion here is a notion of how software is built. He seems to assume a waterfall model, where stages of deployed softwares are designed, built, debugged, and released in repeating monolithic batches. Some newer software development methodologies have emphases on different points of development. XP and other agile methodologies test first, and develop to agree with those tests. This has two implications for reliability and what I will call deployability (answering, "should I update my software?"): The list of known bugs and known ex-bugs are verified at every point, and the environment is designed to tolerate incremental improvements and feature addition. The latter, especially, may mean to a systems administrator that there were a significantly stronger benefit to upgrade.

Problem Outlined
In Why Do Computers Stop and What Can Be Done About It, Jim Gray notes that increasingly more problems in complex systems are the result of human error instead of hardware failure. He discusses the nature of these faults and how to mitigate their effects.

The Solution
It seems like Gray feels that transactional behavior is the key to fault tolerance in software systems. The approach: have groups of programs run beside each other, dedicated to providing a service. Did one of them encounter an unexpected signal? Did its TCP session die? Did another program just write over the data it was supposed to read? That's OK: just fail the transaction associated with its task; another program in the group will take over for it.

Gray advocates bundling operations together as transactions, reducing the complexity of the failure space in an effort to help programmers isolate problems faster. He also likes the fail-fast behavior of transactions. The mentality: don't let programs limp along after a failure; instead, make them give up and alert some authority to the existence of a problem in a particular module.

Finally, Gray feels transactions help a system handle seemingly inexplicable “heisenbugs” gracefully.

What's Missing
Transactions, though useful, are not a silver bullet. Reasoning about the behavior of many concurrent transactions is often made difficult by the existence of a scheduler that retries failed transactions at a later time (an approach commonly used in these systems) or immediately instructs an already-running program to take over.

Similarly, software redundancy is also not a silver bullet. It is hard to guarantee a backup copy of a process won't encounter the same the problem that caused the first to fail.

On a related note, using transactions as an approach to dealing with heisenbugs encourages building software on top of transactional routines that are poorly understood. If you don't know the cause of the bug, how can you be sure its frequency of incidence won't increase to an unacceptable level later?

Why do computers fail and what can be done about it ?

Summary

The author analyzes the failure statistics of the large fault tolerant
systems and classifies sources of failure on the basis of frequency of
occurance. Then, he establishes that system administration and software
failures dominate. So, he proposes and evaluates various approaches like
"process pairs", "trasactional communication, storage" etc for building fault
tolerant , self-recovering software components.

Description of the problem being solved

The problem is to find out the major sources of failures in large
systems used in availability critical applications like hospitals etc and then
address these sources of failures to build a non-stop computer with a mean
time to failure in several years.

Contributions of the paper

1) A thorough analysis of real world failure statistics and the
establishment of the fact that hardware components are doing better in term of
fault tolerance but software components are not.

2) Identifying process as a clean unit of modularity , service , fault
containment and failure. And that they should be fail fast. That is they
should work or they should quickly fail but not do anything in between these
two approaches.

3) Process pairs to increase availability by redundancy is inspired
from hardware redundancy approach. This is a very practical approach is
improving existing software as compared other approaches like improving
software devlelopment process etc. to build robust components of the future.

4) Classification of bugs as bohr-bugs and heisen-bugs is insightful.

5) Using transactions coupled with process pairs to arrive at a easy
to program and yet very fault tolerant software building approach.


Flaws in the paper

1) Does not talk about process state backup for persistant process
pairs in the light of for multi processor or multi thread scenarios.

2) Could have quantified the easiness in building persistant process
pairs by providing details on how much it took for them to build software
using this approach.

Techniques used to achieve performance

1) Redundancy to achive fault tolerance: process pairs are a
combination of processes where the backup process takes over when the primary
fails.

2) Partitioning data and storing them at different geographical
locations to limit the scope of a failure.

3) Remote replication for data availability: seems like todays cloud
storage concept.

Tradeoff made

1) Trading Space and performance for fault tolerance and availability:
Storing state for the backup process to recover consume additional space as
well as computing resources but provides availability.

Another part of OS where this technique could be applied

1) This concept can be applied to faulty parts of the modern operating
systems too. For eg. Prof. Swift has worked on extra state storage to recover
from device driver failures which occur most often.

2) partioning of data and redundancy are the basis of many storage
device techniques like RAID.

Summary
This paper analyzes failure data reported to Tandem Corporation by its customers as well as gives an overview of other studies at the time to suggest ways to make systems more reliable.
Problem
The paper reports that at the time, well managed systems could expect 99.6% availability, which is unacceptable to critical systems like patient monitoring, financial transaction processing and the like.
Contributions / Techniques
Notes that hardware is fairly easy to make highly available through the use of redundant and fail-fast modules.
The author analyzes failure data to determine what are most important types of failures to address, and gives recommendations about how to deal with them.
Infant mortality, which refers to product immaturity – avoid immature products
Software bugs – the author notes that in production machines, most software bugs are “soft”. By this he means they are seemingly transient and very difficult to debug. Therefore he analyzes primary ways to structure software to provide reliability in the face of these kinds of bugs: modularity for fault isolation, process pairing to provide “hot” backup in the case of a crash, ample data checking to detect errors, and transactions to provide a clean recovery mechanism.
Administrative errors – the author recommends automating administration as much as possible and making simple interfaces. He uses the example of a hardware device that can be installed without instructions because it is “obvious.”
Flaws
This paper is very well written. The only flaw I can see is that it seems to make conclusions from failure data of one Tandem system. While useful, it would be nice to see analysis of data from a broader set of systems
Other uses:
Almost every software and hardware system strives for reliability, so these techniques are applicable nearly everywhere. Notably, transactions with ACID properties are widely used in database applications. In operating systems one area this may be useful is in making changes to critical system data (configuration files, registry, etc.) that is saved on disk. More generally though, file systems attempt to provide ACID like properties through the use of journaling.

Summary
Despite work toward making computer systems more reliable, most systems still suffer failures. In “Why Do Computers Stop and What Can Be Done About It?” Jim Gray analyzes failure data from users and postulates ways to make systems more reliable.

Problem
Typical, well-managed systems of the time had a 99.6% availability – translating to about ninety minutes of down time every two weeks. While 99.6% availability sounds very good, there are applications in which this availability is not sufficient. With the rise of dependence on computer systems, the importance of availability continues to become more and more important. Certain applications require a system that is practically perpetually available; to arrive at such systems, it is important to plan for systems in which a variety of failures can be tolerated and recovered from without affecting the availability of the service which the system provides.

Contributions
• Analyzing failure data to find what failures were caused by: most prominent were failures due to “infant mortality” – failures due to new software or hardware that still had bugs that were being worked out. Looking beyond the “infant mortality” problem, the most common source of errors was system administration, followed by software, then hardware, and finally environmental failures.
• Using the failure data to present an approach by which failures could be minimized: designing systems to tolerate operation and software faults.
• Observing that hardware-related failures are relatively rare and that, since hardware is becoming more reliable, these sorts of failures are likely to become less of a problem with time and also that software related errors are currently more abundant and that they are likely to become even more so as systems become more complex. This implies that the focus should be put upon improving software related reliability.

Flaws
• Failure statistics are not necessarily exact – some failures were not reported, possibly causing the proportion of the causes of failures to differ from what is described. Furthermore, the statistics only come from one specific system, increasing the potential that the statistics are not broadly applicable (for example, perhaps this specific system had better than average hardware reliability and lower than average software quality, possibly inverting the proportions of the failures attributed to those categories).

Techniques
New techniques are not defined in this paper, rather, it is suggested that available techniques should be employed in order to increase reliability. Jim Gray implies that allowing some performance to be sacrificed and that adding some additional complexity in order to achieve higher reliability is acceptable.

Summary

The paper provides statistics of the reasons for system outages. Infant mortality failures, system administration and software faults are identified as the major source of system outages. It addresses software fault tolerance issue by quantitatively discussing the benefits of modular software, fail-fast processes, process pairs with transactions, fault tolerant storage through replication.

Problem attempted

Computer systems used in critical applications like patient monitoring, stock markets etc. have to be systems which virtually never fail. The focus of this paper is to achieve this objective through various mechanisms that guarantee a very high Mean Time Between Failures (MTBF)

Contributions

1) The paper provides useful statistics on the reasons for system outages. About a third of the failures are due to infant mortality (i.e. products that
are yet to stabilize), another one third due to system maintenance issues, and 25% due to software faults.

2) The main conclusions drawn from these statistics are : hardware fault-tolerance can be achieved through redundancy; maintenance difficulties can be reduced by simplifying maintenence interfaces and by not meddling with the system unnnecessarily; software bug fixes need not be installed unless the bug is very critical.

3) Softwares crash very less compared to what must be expected with a bug in about every 1000 lines of code. This is primarily due to their modularity through processes where process is the unit of failure. Terminating a misbehaving process is always a simple and effective solution.

4) The paper suggests process pairs as an effective method of improving software fault tolerance. Here a pair of processes, namely the primary and backup processes, are dedicated to servicing a requestor.

5) In one implementation of process pairs, the primary process sends state changes and reply messages to its backup after each major event. This being complicated to code, authors suggest another implementation of process pairs (called as persistence) where the backup waking up with a null state and being unaware of all that happened before primary failure.

6) When persistence process pairs are used with transactions, the cleaning up of the inconsistent database and system states due to all uncommitted transactions associated with the failed primary is taken care of.

Flaws

1) The paper doesn't address the system becoming unavailable due to malicious attacks like Denial of Service attacks. In general, the issue of Systems being robust against intended attacks has not been addressed at all.

Tradeoffs

1) Reliability and availability vs Performance -- Transactions and process pairs are sure to bring down the performance. But they provide effective fault tolerance. A similar tradeoff is there between reliability and cost, where having redundant hardwares improves reliability but it increases the cost, though cost is not a very serious concern for critical systems.

Techniques used

1) Modularity - Software must have a modular design for effective fault isolation and minimal redundancy.

2) Fault tolerance through redundancy is a classic technique used in storage (RAID).

Introduction:
This paper by Jim Gray is a high-level analysis of different modes of failure in computer systems. The author analyzes real-world failure characteristics and concludes that software and human errors are the primary causes of system failures. He analyzes techniques used in hardware to decrease failure rate and applies the same concepts to software.

What were they trying to solve:
Mission critical computer systems have a steep penalty for failures. Even when an error occurs, the time to restart the system back to an operational state is very high. Also, errors are more prone to happen during peak loads, which further exacerbate the problem of availability. While hardware reliability improved steadily with technology, the reliability of systems actually decreased with time superseded by higher software complexity.

Contributions:
The MTTF of hardware systems can be increased by a large factor by:

Increased Modularity
Make them fail-fast: they either work correctly or stop working
Prompt detection of faulty hardware.
Extra modules to pick up load in case of failure.

The following observations about system failures were noted:

Most systems fail due to administrator, user or software errors. "Infant software", which is newly installed software accounts for a large chunk of failures.
Most software bugs are soft: they are hard to reproduce.

Simpler maintainance interfaces decrease administrative errors.
Software errors can be minimized by:

Software modularity provides software isolation.
Like in hardware, software modules must be fail-fast.
Redundancy at process level using process pairs
Transaction based system for maintaining ACID properties.

Communication systems can be made fault-tolerant by introducing session semantics: All communication follows a well-ordered protocol.
Storage systems can be made fault-tolerant by replication
Flaws:
The paper is too general and only provides a high-level solution to problems. Real systems often have to made tradeoffs between performance and fault-tolerance, as well as between cost and fault-tolerance.
The analysis of errors is largely empirical and extrapolated from a small sample space. While general trends may hold true, some of the specific assumptions may not hold true in general.

Techniques used
Hierarchical self-contained Modules
Redundancy at each level.
Transactions to maintain consistency
Tradeoffs:
Redundancy vs utilization: Redundancy comes at a cost of resources not being utilized for actual work.
Software complexity vs reliability: Reliable software in general has to handle cases which may never or rarely occur.
another part of the OS where the technique could be applied:
Redundancy and paralellism is used extensively wherever recovery from error is required. For example, RAID can mirror data into two devices.
Modularity is also extensively used to provide isolation. SIPs in Singularity is a good example of that.

Summary
Jim Gray in this paper, presents his views on software reliability and availability. He shows that hardware fault tolerance works and draws parallels with similar solutions in the software domain. He claims that by including modularity and redundancy in software design, the mean time between failures(MTBF) can be increased.

Problem
Jim Gray analysed the system outage reports from vendors to analyse the top reasons for a system failure. Based on his analysis, he concluded that most of the outages are caused by administrative/operator errors followed by software failures/bugs(25%). He noticed the fact that hardware failures were comparatively a rare occurrence. So by reducing the software failure rate, the MTBF for the system can be increased.

Contributions
- He uses concepts that have been used in hardware systems and tries to apply them to software systems to increase their fault tolerance. Just like in hardware systems, software should be decomposed into fail-fast independent modules. (e.g. SIP from Singularity).
- He also suggests using redundancy of modules so that failures are not even visible to the users. By combining process-pairs with transactions, we can have redundancy and hence high availability without saving too much state or increasing the load on the programmer.
- Apart from acting as a backup, secondary processes also help in handling HeisenBugs(transient bugs). As the backup process, starts with a clean slate, it has high chance of succeeding if the failure was caused due to a transient bug(e.g. race condition)
- Using similar concepts of redundancy(multiple paths) and session management, availability of a communication network can also be increased. Redundancy(replication) of data also helps in providing fault tolerant storage systems.

Flaws
- Process-pairs become complicated in a multi-threaded process where there is no set point at which all transactions have ended. As a result, we may not have any point at which we can snapshot the system and safely restart.
- Everything cant be roll-backed(network/printer) because of which transactions have a limited scope. So the guarantee of atomicity may not be always be feasible.

Technique
- Modularity and redundancy are the basic techniques suggested by the author to ensure fault tolerance.
- For achieving this level of fault tolerance and high MTBF, hardware costs(for twice the hardware) and memory size(for snapshots/transaction) increase.
- The basic techniques suggested in this paper are used in almost every fault tolerant system.(even in our day to day activities-duplicate keys)

Summary:
This paper presents some causes and remedies for failures in computer system (in 1985), which are still pertinent. Administration and software soft-faults are the primary causes for failure. The remedies (as proposed in this paper) are transactions, per-process pairs and reliable storage, etc.
Problem:
This paper, by the famous Jim Gray, presents a view of how computer systems looked back in 1985 and studies the failure statistics of a commercially available fault-tolerant system. Further, it proposes different approaches to software fault-tolerance such as persistent processes, process pairs and transactions.

Contributions:
One of the major contributions of the paper is to differentiate between the concepts of availability and reliability. Modularity and redundancy can increase the reliability of the system, thereby inturn improving its availability.
It isolates the different causes behind failures of a commercially available computer system at Tandem. Major sources of faults reported are administration, maintenance, software and operations. The author proposes defensive programming, process-pairs to increase redundancy and thereby improving fault-tolerance. Further transactions and persistent processes can provide ACID guarantees to operations in the system.

Flaws:
The paper is actually reflecting the opinion of a Turing award winner - Jim Gray. I believe finding a flaw in the paper will be inappropriate, though my opinion may differ at some points. For the Tandem statistics, under-reporting and lack of knowledge of events, has forced the author not to isolate some failure causes at many instances, (eg. about double failures).

Techniques used:
Prevention is better than cure. Redundant modules can always help in avoiding the damages caused by the soft faults in a system. Automation, persistent processes and transactions can come to rescue for fault-tolerant communication as well as execution.

Tradeoffs:
Redundancy (even at a factor of 2), unlike the Von Neumann model, can increase the cost of the systems. There is a famous philosophical riddle - "If a tree falls in a forest and no one is around to hear it, does it make a sound?". Based on these lines, I believe redundancy of certain modules is irrelevant in case their failure does not affect the overall availability of the system.

Alternative uses:
Transactions and the ACID properies are well known in the domain of databases. Redundancy has found place in reliable storage systems (RAID). Modularity is well known-concept used in the design of object-oriented OS and in the form of process abstractions in almost all OSes.

Summary:
Paper talks about different reason of system failure. It also explain different mechanism like Modularity plus redundancy to improve the reliability of system.

Problem:
Critical system application like patient monitoring, online transaction processing require high availability. How to provide high available system is the main problem discussed in this paper.

Work Summary / Contribution:
1. At hardware level, high availability is obtained by constructing fail-fast modules and putting redundant module. Redundant module will be used in case primary module fails.
2. According to paper 1/3 of failure are caused by infant morality (immature product ). So customer should stay away form such product, which has not been tested properly.
3. Most software bugs are soft so re-execution in case of failure solve the problem in most of the case but issue here is who should re-execute and from where should re-execution start. If other process re-execute then how to undo the changes done by failed process or if back-up process re-execute from the state in which failed process left then how to communicate the state information among process pairs.
4. Paper talk about different approaches to process-pair. Each approach has its issue but Persistent approach is easier to implement but leave hardware or system in mess. So if transaction is combined with persistent approach then this provide excellent fault tolerance execution. Transaction will make sure that if failure happens in middle of transaction then system is brought back to pre-transaction state.
5. Fault tolerant communication can be obtained by having multiple data paths.
6. System configuration, system maintenance are the reason for many system failure. Only solution to such failure is reducing interaction between system operator and system.
Flaws:
1. Paper make conclusion about failure from data of only one tandem system.

Tradeoffs:
1. Tradeoff is made against high availability/reliability versus cost. If double hardware is used for high availability then cost will be doubled as well.

Another part of OS where technique can be applied:
1. Communication fault tolerance technique is used currently in network system to deliver packets reliably. In Internet node are connected by multiple paths so in case some link is failed alternative path is used to deliver packets.
2. Fault Tolerant storage technique is used in RAID to provide the high reliability and availability. Two or more disks are combined together and one of them works as mirror so incase one disk fails mirror disk can be used.

Summary
The paper analyses the different reasons for failures in computer systems and identifies the main ones to be software and administration. It then proposes an approach involving modularity, automatic configuration, process redundancy, and transaction model to alleviate the failure effects and improve system availability

Problem Statement
The key problem addressed in this paper is system failures and resultant availability figures. The paper goes on to identify a viable approach to address this problem and build reliable systems.

Contributions
- The primary contribution of the paper is a rather multifaceted analysis of failure conditions in systems, and an evaluation of a few ways to address these.
- The paper identifies software and administration as two key causes for failures.
- The paper stesses on automating system configuration to the greatest extent to reduce human errors at maintenance which affect availability adversely.
- The paper proposes a pair redundancy model for processes coupled with transactional processing to improve software availability.

Flaws/shortcomings
- The paper uses the Tandem systems alone for it's analysis of reliable systems.
- The paper tends to generalize from instances a lot.

Technique used
The key technique used is redundancy at the process level. That is coupled with transactional processing to achieve reliability.

Tradeoff
Complexity and development overhead is the tradeoff in using process redundancy and transactional model. Also, this would possibly tax the CPU more.

Applications
The process redundancy concept is employed in Cisco's latest router oprating system Cisco IOX. Here, each process has a redundant standby process which takes over when the current master porcess fails. And the standby inturn creates another standby process when it becomes the master now.

.

In this paper, Jim Gray describes the sources of failure in fault-tolerant systems. He presents failure statistics based on the Tandem NonStop system, analyzes them and then discusses various approaches to software fault-tolerance.

Many computer applications such as online transaction processing and patient monitoring require high availability. In these systems outages are unacceptable. In commercial fault-tolerant systems the major sources of failure are administration and software. The author presents the advantages and drawbacks of some approaches which can be used for software-fault tolerance.

In my opinion the major contribution of the paper is a systematic analysis of the failures of a system. The author collects data, presents failure statistics, analyzes them
and finally proposes some solutions. This paper is worth to read as it shows how a systematic study of a topic should be done. Another contribution of the paper is that it shows that most failures are caused by software and administration and not by hardware. This happens because the techniques used for fault-tolerant hardware are quite successful. For this reason the author discusses various techniques that could be used for software fault-tolerance. The author proposes to decompose the system into modules, so that a failure if a module does not propagate beyond the module. The modules should be designed fail-fast because fail-fast software has small detection latency. The author, then presents the advantages and drawbacks of several approaches to design process-pairs. Moreover, he argues that by using transactions the application programmer don’t have to handle many errors. Transactions combined with persistent processes can give excellent fault-tolerant execution. Moreover, persistent processes are simple to program. Fault tolerant communication can be obtained by combining transactions with sessions. As far as it concerns storage, we can achieve fault-tolerant storage by replicating the data.

In my opinion one flaw of the paper is that only the Tandem system was considered. The statistics might have been different if other systems were also considered. Finally, although the paper presents a very good analysis of the data, I think that there are some assumptions when evaluating it (e.g some failures were under-reported). However, I don’t think it is easy not to make these assumptions and find totally “real” data.

The paper doesn’t present new techniques. It points the need for software fault-tolerance. The approaches discussed are presented in the second paragraph. The author proposes to use modularity and redundancy in order to achieve higher availability. A disadvantage of this approach is that more space is needed and of course more hardware. This means that the cost is higher.

Summary: Besides a careful analysis of computer failures, this paper is mostly a position paper. It advocates self healing software composed of modules that fails fast in case of errors. They propose redundancy and transaction like processing for increased reliability and availability.

Problem: In many circumstances even small chances of failures are not acceptable. By classifying many reported failures, they observed that the largest part is caused by administration and software bugs. Administration failure are caused by humans, but 'to err is human' so they are ultimately attributed to software usability. After heavy testing, the bugs left in software are likely to be unreproducible and go away when retried. Finally, they note that monolythical systems, besides being hard to understand, tend to fail completely in case of errors.

Contributions & Reliability Techniques: They show that availability depends not only on the Mean Time Between Failures (MTBF) but also decreases with a large Mean Time To Repair (MTTR). They propose modules as the unit of failure or replacement. Redundant modules are much less likely to all fail than single modules. Besides, when one module is down, the other can instantly take over, reducing MTTR. However, this insight is successfully used in hardware, but it is less obvious how to apply in a software environment. The author's insight is that in production code, many software bugs are irreproducible. Moreover, one doesn't have to retry from start, merely from the last consistent state and still 'miss' the bug. They define transactions as group of operation that either fail or (atomically) take the system from one consistent state to another. Finally, they propose combining process pairs (where processes can be physical or logical entities) with transaction as the solution for software module redundancy. If one process fails, the other one takes over from the last checkpoint state, possibly undoing changes of failed transaction (and is unlikely to fail).
For this to work, one needs reliable communication between processes and reliable storage to maintain the checkpoint state. For these, replication is again the key to reliability: multiple data paths for communications with resumable sessions, and storage in multiple locations (and if possible different environment). The storage process itself is viewed as a transaction with ACID properties.
As for the reduction in human error, they advocate self configured systems.

Weaknesses: Not all software can be organized in transactions, and in some cases the overhead for maintaining and communicating state may not be acceptable.
They disregard "infant mortality" (most likely caused by reproducible bugs), but infancy can be half the life of an IT product in today's fast innovation pace. Also 1 failure for 100 computers running each 10h is considered as 1failure/1000h. This is a good approximation, but omits the actual aging: one computer that runs 1000h may fail more often simply because of aging components. Similar considerations could apply for Software, where eliminating "infancy bugs" may cause code bloating and an increase in transient bugs.

WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT

SUMMARY
This paper performs an analysis on commercial fault-tolerant systems and discuses different approaches to software fault-tolerance that current systems use. They analyze the reported failures in the systems of four clients of Tandem Computers Inc. systems. They distribute the failures in five different categories: administration, software, hardware and environment.

PROBLEM
This paper determines that faults in a system are usually due to administration and software errors. Software grows constantly and its is more and more difficult to have perfect coding practices and coding testing. There are fixes for software faults that can be installed and solve known bugs. But we need systems that are software fault-tolerant. This paper presents some of the solutions to this problem.

CONTRIBUTIONS
This paper establish the differences between availability and reliability. It states that availability is doing the right thing in the specified time and reliability is not doing the wrong thing. Based on this concepts it studies the availability and reliability obtained through different techniques.
Hardware fault-tolerant systems achieve high availability through modularity and redundancy. These techniques are also found in software fault-tolerant systems.
Software modularity consists in decomposing large system into separate modules. Processes as the unit of execution provide isolation, and can be stop when they are not operating correctly. Fail fast software modules means that the processes will work properly or will fail. So recovery activity can be started faster.
Most production software bugs are soft. Software bugs may be due to strange hardware conditions, if the program is restarted and the instruction executed again it will most likely work. A solution for these errors are process-pairs. They consist in having extra software modules that can execute in different ways to obtain higher availability overcoming different types of errors: lockstep, state checkpointing, automatic checkpointing, delta check pointing and persistence.
Transaction mechanisms provide a very good result when used with process-pairs. Transactions are operations in which all the instructions success or none success, this way you always know the state of the system. Using transactions with persistent process-pairs solves the problem of the unknown state for the backup process.
Fault tolerant communication and storage are both obtained through duplication of the sent messages over different paths and duplication of the data.

FLAWS
It is difficult to find flaws in this survey paper. The data collected at the beginning on the paper seems to be not very accurate and based on lots of assumptions, although it is enough to have a good idea of what causes failures in a system. It is not clear the number of systems they are looking at and if they were all the same version or they were acquired at different times, which may have made some systems have the same problems due to infant errors or development errors common to all systems. We should also consider that all the systems are from the same company.

PERFORMANCE
The techniques proposed are techniques to reduce software and hardware failures in any system. These techniques are modularity, redundancy, fast fail modules and transactions for software errors.
Redundancy trades space and price (in the case of hardware) for reliability. Modularity implies more design effort for a more reliable and available program.
All these techniques are widely used in OS. Check pointing is used in file systems to recover after a crash. Redundancy can be used in distributed systems to increase response time. Transactions are used in any part of the OS where we want atomicity and consistency.

Summary:
The paper discusses a variety of causes for failures of computer systems and the ones which have the most impact on them. A very high availability is able to be provided by fault tolerant hardware, while a lot of errors or outrages are contributed by software and administration problem, so several techniques are provided in this paper to improve software reliability so that the MTBF (Mean Time between Failures) or MTTR (Mean Time to Repair) can be improved.

The goal the paper was trying to deal with:
In order to reduce the frequency of outages caused by software and administration, the goal of the author is to make the software to be fault-tolerant by suggesting using some techniques including modularity and redundancy.

Contributions
1. The paper makes a convincing argument that most system errors or outrages are caused by software and administration, instead of hardware failures, and proposes some new mechanisms to deal with these faults and to improve availability.
2. A variety of techniques are discussed in this paper for developing fault-tolerant software. These techniques include process pairs, transactions for data integrity, fault tolerant communications and so on. These methods are quite useful to improve MTBF (Mean Time between Failures) or MTTR (Mean Time to Repair).
3. The paper wisely reuses the concepts of 'transactions' and 'sessions' from database domain for fault-tolerance - in execution, communication and storage. Transactions are Group of operations that form a consistent transformation of state – ACID, which are Atomic, Consistent, and durable.
4. The paper suggests that computer should avoid immature products for achieving high availability.

Flaws:
(1) The first flaw of this paper is that a lot of guesses are made without justifying them.
(2) The second flaw of this paper is that the author proposes this results only by looking at the data from one type of system (a commercial system made by Tandem Computers), the author is supposed to derive this results from observing more systems.
(3) I do not agree that hardware necessarily gets better with time. If increasing software complexity triggers more bugs, hardware complexity increases should also cause less reliability.

Tradeoffs:
The major goal of the paper is to increase fault tolerance, however one tradeoff is made because costs are also increasing due to redundancy.

Summary
Why do computers stop discusses techniques that can help increase the Mean Time Before Failure (MTBF) The paper first discusses an analysis of failures from one product, and then explains properties of fault tolerant systems.

Description of Problem
This paper presents properties that can be implemented to help improve the MTBF of more conventional machines that fail about once every two weeks.

Summary of Contributions
- Subjectively concluding that software failures, operator actions, system configuration and system maintenance consist of a large portion of failures of a system.
- Enforcing the idea of modularity, independence, and parallelism are necessary to improve reliability. Software modularity with processes and messages for simplicity. Parallelism with software and hardware allow one to fail while still having a redundant secondary.
- Bringing together a number of ideas from previous authors, such as Mourad, and Neumann as well as associating software bugs to the Heisenberg physics principle
- Categorized failure causes (although subjective) in order to isolate the primary causes of system failures.
- Present the idea that fault tolerance needs to be designed into the system, and not an afterthought.

Flaws
The papers primary flaw is the low level of detail and concrete evidence presented. The author could have analyzed the faults in a number of different product lines and stronger rules on the categorization that was done. In addition, the paper presents ideas and data that could use more enforcement.

Techniques
The paper does not necessarily present any new techniques, but expresses a need for developing software fault-tolerance techniques into systems. The techniques of this paper focus on replication for fault-tolerant storage, robustness in the communication layers for fault-tolerant communication, and transactions and persistent processes for software fault tolerance. The primary tradeoff is improved MTBF for additional design features. The techniques could be applied to all developers of an operating system to help them think about fault handling at the design stage since this is important for increasing the MTBF.

Summary
This paper analyzes the sources of failure in a fault-tolerant system and discusses a number of issues relating to software fault tolerance

Problem
Systems that are unavailable due to failure create delays and problems, especially in specific systems and peak times. We would like to reduce the frequency of such failures and/or minimize the time to recover from failures.

Contributions
Notions of Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) are used to define a straightforward notion of availability. The author notes that it is common in a hardware setting to use redundancy (such as hardware pairs) and modularity to significantly increase the MTBF. As well, if the MTTR is small, it is perceived as a delay rather than a failure. Redundancy allows some parts to fail while the rest continue on.

The paper analyzes sources of failures, namely: administration (including operator actions), hardware, software, and environmental factors. In order to decrease administration and operator failures, systems should be self-maintaining or easy to maintain. Hardware is already fairly reliable, considering methods such as redundancy, and the hardware itself will continue to become more reliable.

Software fault-tolerance is considered in depth, as an area that can be improved, although it may not be as evident. Particularly, it is argued that in more or less proven systems, fixing remaining bugs can do more harm than good. Furthermore, in proven systems, remaining bugs are generally soft (as with hardware). That is, they are hard to duplicate, and upon retry, often disappear. Thus, methods applied to hardware (such as pairs) can be successfully applied to software and this sort of fault tolerance can be more fruitful than attempting to fix the remaining bugs. The author suggests using persistent process-pairs with transactions to guarantee data integrity.

Summary
This paper presents analysis of causes and anatomy of system failures and shows that software and administration are the major contributors to it. Various approaches to software fault-tolerance are discussed and persistent process-pairs with transactions provide highly fault tolerant system.

Problems:
System failures are completely unacceptable in certain environments. There has been no analysis on commercial fault tolerant system. The software faults and administrative errors are much more than hardware failures. This paper proposes using modularity and redundancy in software as a way to improve software reliability.

Contributions:

• Extremely high MTTR can hide failures.
• Hierarchically decompose the system into modules. Redundancy increases availability. Both these techniques provide continuous service even if few components fail.
• Modules are designed fail-fast- , it either functions properly or stops.
• Administrative and maintenance should be simplified to have minimal operator intervention.
• For high availability, avoid immature products.
• Software faults are soft, heisenberg bug and are difficult to reproduce.
• Process-pairs provide high tolerance with transactional system. When primary process fails, backup comes up and takes care of all operations. Transactional system maintains the system of incomplete operations and rolls back.
• Transactions plus resumable communication sessions give fault-tolerant communications. Transactions plus data replication give fault-tolerant storage.

Flaws:
The resulting systems hardware system is said to have MTBF measured in decades or centuries. There are no actual facts about it. The paper consists of many assumed data, so don’t know how reliable the calculations are? The author says that data is missing, so how can results be so sure? The paper doesn’t include infant system into study; they account for 30% of total system failure and should have been part of study.

Performance:
The paper talks about improving the reliability of the system by modularization and redundancy. They improve software fault tolerance by introducing process-pairs with transactions.

Tradeoffs: The software is made more complex for higher reliability. Process-pair requires lot of effort to program correctly.
Redundancy can be applied to almost every application area for fault tolerance.

Summary: The author of this paper presents the argument that administration and software problems are the leading causes of machine failure based on a study of Tandem NonStop system. Using these statistics the author presents different suggestions to help prevent these types of introduced faults.
Problem to Solve: The problem the author is trying to categorize in a real study the cause of system failures. They are also trying to point out how fault-tolerance is achieved in some of these situations and determine how future MTBF can be increased in certain situations by introducing minimal amounts of redundancy.
Contributions: One thing they mention is that software should be modular. Modularity in software allows one piece of code to fail and be isolated from the rest and allow for easy fixes because the fault can be traced to a single part of the code. Another thing they mention is the idea of fail-fast software. This means the if the software is wrong it should signal a fault and stop processing otherwise it should function as normal. This is important because you now know that if a process is running it should only be doing correct things. One contribution they make is the introduction of the ACID property into transactions. ACID is a unique technique that allows the programmer to reset the system to a consistent state. They also introduce the notion of process pairs for fault-tolerant execution. This is an important technique that truly makes fault tolerant work in practice through different types of redundancy.
Flaws: Although the authors present advice on improving system performance, it seems like a lot of the advice is really only targeting the system maintenance part of the problem. Sure system redundancy is helping with system configuration, but this is a very small part of system configuration as we know it today.
What tradeoff is made: One tradeoff is increased hardware costs for increase fault tolerance.

This paper discusses why computer systems fail. Jim Gray motivates the problem by aggregating and categorizing anecdotal evidence, then formally expresses availability and discusses how we can increase availability.

The main contribution is the systematic study of why systems fail. Surprisingly, systems mostly fail because of non-hardware faults. The key to achieving high availability is therefore the ability to detect and mask all types of failures. Detection is made easy if systems are fail-fast: they stop and complain loudly that something bad happened. Masking failures is achieved by redundancy. Finally, having modular systems prevents propagating faults and minimizing restart time increases availability in the presence of transient errors. Unfortunately, no redundancy scheme is proposed for masking operator faults.

The paper sets a modest goal and straightforwardly answers the question, which makes finding a flaw much more difficult. One criticism might be that Jim Gray's world is a world of concrete guarantees: the database is consistent or not, the transaction either aborts or commits, the system must be 99.99% available or the deal is off. In a world where eventual consistency and k-safety are starting to be accepted as properties that are "good enough" for an increasing number of applications, it would be interesting to expand the notion of availability to include those techniques as well and make a comparison of how each property affects the availability of the system.

The paper demonstrates why building reliable systems is hard: redundancy, modularity and transactions are essential if one wants to achieve high availability. This trade-off impacts all aspects of design and significantly affects the final cost of the system.

Post a comment