CS 736 – Spring 2006
Lecture 20: Fault Tolerance
- General Terms
- Model software as a stochastic process, fails
randomly
i. While has specific causes for failure, enough
randomness in the system that is a good approximation
- MTTF == reliability, how long is it available
for continuously
- MTTR == time to repair
- MTBF = MTTF + MTTR, deprecated
- Availability
= mttf / mttf+mttr
i. 99% ~3
days
ii. 99.9% ~9
hours
iii. 99.99% ~1
hour
iv. 99.999% ~5 minutes
v. 99.9999% ~30 seconds
vi. Note: can get high availability by
shrinking MTTR or growing MTTF
- Fault – problem with HW or SW
- Error – corrupted state due to
executing HW or SW
i. Latent – not read, or read but no impact
ii. Effective – caused a failure
- Failure – visible problem, incorrect
behavior, due to error
- Failure models:
- All depends on what ÒcorrectÓ means. Often
not specified
- Timing – miss a deadline
- Output – produce incorrect output
- Omission – skip an output
- Crash – skip an output, produce no more
output
- Byzantine
i. Anything can happen, including malicious behaviors
- System response:
i. Fail-stop processors map all failures onto crash
failures
1. Halt on failure: the processor halts before making
an erroneous state transition
2. Failure status: the failure of a processor can be
detected
3. Stable storage: processor state is separated into
volatile storage, which is lost on failure, and stable storage, which is
preserved uncorrupted.
- Big question: Evaluation
- How do you evaluate reliability / fault
tolerance?
- General Approaches
- Point 1: where do you provide fault tolerance
i. In the application?
ii. In a library
iii. In the OS
iv. In the HW
- Point 2: what is protected?
i. Application persistent state
ii. Application transient state
iii. OS transient state
iv. Hardware state
- Fault Avoidance
i. Prevention: make sure bugs never enter code
1. Type safe languages
2. Software engineering techniques
ii. Removal: Remove bugs from existing programs
1. Bug finding tools
a. Model checking
b. Theorem proving
2. Code reviews
3.
iii. Work-around
1. Firewall
2. Human workaround
- Fault Tolerance
i. Redundancy – execute multiple times
1. Single-version / Multi-version
2. Spatial / temporal
ii. Diversity:
1. N-version programming
2. Recovery blocks
iii. Isolation: confine errors to a single component
iv. Modularity: keep components small
v. Error detection
vi. Recovery
1. Forwards / Backwards
2. Concealing / revealing
3. Basic approaches:
a. Logging / retry
b. Checkpoint / restore
c. Replicate (process pairs)
d. Alternate versions
e. Transactions (undo)
f. Reveal faults up the stack