Fault Tolerance

CS 736 – Spring 2006

 

Lecture 20:  Fault Tolerance

 

  1. General Terms
    1. Model software as a stochastic process, fails randomly

                                              i.     While has specific causes for failure, enough randomness in the system that is a good approximation

    1. MTTF == reliability, how long is it available for continuously
    2. MTTR == time to repair
    3. MTBF = MTTF + MTTR, deprecated
    4. Availability = mttf / mttf+mttr

                                              i.     99%                    ~3 days

                                             ii.     99.9%        ~9 hours

                                           iii.     99.99%      ~1 hour

                                           iv.     99.999%    ~5 minutes

                                            v.     99.9999% ~30 seconds

                                           vi.     Note: can get high availability by shrinking MTTR or growing MTTF

    1. Fault – problem with HW or SW
    2. Error – corrupted state due to executing HW or SW

                                              i.     Latent – not read, or read but no impact

                                             ii.     Effective – caused a failure

    1. Failure – visible problem, incorrect behavior, due to error
  1. Failure models:
    1. All depends on what ÒcorrectÓ means. Often not specified
    2. Timing – miss a deadline
    3. Output – produce incorrect output
    4. Omission – skip an output
    5. Crash – skip an output, produce no more output
    6. Byzantine

                                              i.     Anything can happen, including malicious behaviors

    1. System response:

                                              i.     Fail-stop processors map all failures onto crash failures

1.   Halt on failure: the processor halts before making an erroneous state transition

2.   Failure status: the failure of a processor can be detected

3.   Stable storage: processor state is separated into volatile storage, which is lost on failure, and stable storage, which is preserved uncorrupted.

  1. Big question: Evaluation
    1. How do you evaluate reliability / fault tolerance?                 

 

  1. General Approaches
    1. Point 1: where do you provide fault tolerance

                                              i.     In the application?

                                             ii.     In a library

                                           iii.     In the OS

                                           iv.     In the HW

    1. Point 2: what is protected?

                                              i.     Application persistent state

                                             ii.     Application transient state

                                           iii.     OS transient state

                                           iv.     Hardware state

    1. Fault Avoidance

                                              i.     Prevention: make sure bugs never enter code

1.   Type safe languages

2.   Software engineering techniques

                                             ii.     Removal: Remove bugs from existing programs

1.   Bug finding tools

a.    Model checking

b.    Theorem proving

2.   Code reviews

3.    

                                           iii.     Work-around

1.   Firewall

2.   Human workaround

    1. Fault Tolerance

                                              i.     Redundancy – execute multiple times

1.   Single-version / Multi-version

2.   Spatial / temporal

                                             ii.     Diversity:

1.   N-version programming

2.   Recovery blocks

                                           iii.     Isolation: confine errors to a single component

                                           iv.     Modularity: keep components small

                                            v.     Error detection

                                           vi.     Recovery

1.   Forwards / Backwards

2.   Concealing / revealing

3.   Basic approaches:

a.    Logging / retry

b.    Checkpoint / restore

c.    Replicate (process pairs)

d.    Alternate versions

e.    Transactions (undo)

f.     Reveal faults up the stack