Nooks

-       does it require reading individual drivers?

-       Performance hit?

 

Notes from reviews:

 

  1. Nooks
    1. Approach:

                                              i.     Improve reliability by tolerating dominant cause of failure

1.   DonÕt bother making everything reliable

2.   Try to make it integrate well with existing OS

3.   Make it compatible with existing drivers / OS /applications

                                             ii.     Key pieces

1.   Isolation / fault containment: prevent driver from corrupting os/application

2.   Recovery: get driver running again after a failure

  1. How do modularization?
    1. Device drivers
    2. Existing modules
    3. Known to cause errors
  2. How do Isolation
    1. Isolation

                                              i.     LW Kernel Prot Domains

                                             ii.     Prevent driver from writing to OS

                                           iii.     Allow writes to driver-private data

                                           iv.     XPC – invoke code in another domain

    1. Interposition

                                              i.     Inject code transparently

                                             ii.     Like VMM – but boundary is kernel/driver

                                           iii.     Done at load time, not compile time

1.   Note: can choose where to put it!

                                           iv.     Wrappers on driver/kernel interface

                                            v.     Result:

1.   Recompile driver because binary interface changes (macros -> functions)

2.   Pretty much no code changes to drivers

                                           vi.     QUESTION: What happens when modules invoke other modules?

    1. Object tracking

                                              i.     Allow safe-sharing

                                             ii.     Validate shared parameters

                                           iii.     Map between kernel and driver-private data

                                           iv.     QUESTION: What happens on a multi-processor?

  1. How detect failures?
    1. HW: processor fault
    2. SW

                                              i.     Bad parameter

                                             ii.     Excessive resource consumption

    1. External

                                              i.     Human

                                             ii.     SW agent

 

  1. How do recovery?
    1. Normal approach (without nooks): what is it?

                                              i.     Reboot

    1. Alternatives:

                                              i.     unload driver

    1. Restart driver

                                              i.     Unload completely

                                             ii.     Prot domains, obj. track allows completely unloading w/o driver help

1.   Like a process can clean up for itself

                                           iii.     Restart driver

1.   Needs user-level knowledge of how to restart

2.   Issues: where does configuration data come from?

a.    Solved in shadow drivers       

    1. Shadow drivers

                                              i.     Restart , replay log to move forward to state at crash

  1. Issues:
    1. Performance overhead

                                              i.     Where does it come from?

1.   New code in system

a.    Wrappers

b.    Object tracking

c.    Domain change (change page table)

2.   Existing code running slower

a.    More TLB misses

b.    More cache misses due to copying

    1. Implementation overhead
    2. Dependence on interface stability
    3. Assumptions

                                              i.     Are drivers fail stop?

1.   What if driver writes bad data to device?

                                             ii.     Are driver failures heisenbugs?

                                           iii.     Can we virtualize this interface? Is it too ugly?

    1. What happens to applications?
    2. QUESTION: What is real contribution?

                                              i.     Pointing out that drivers are the problem

                                             ii.     Pointing out that compatible driver isolation is possible

                                           iii.     Pointing out that driver isolation can have reasonable performance

                                           iv.     Pointing out the importance of recovery

  1. Evaluation
    1. Fault-injection for testing ability to detect faults / recover

                                              i.     QUESTION: is this a good technique?

                                             ii.     QUESTION: What do we learn from these results?

1.   Nooks stopped the faults we injected

                                           iii.     What are the limitations?

1.   How realistic are faults?

a.    DidnÕt wait a long time for faults to have an effect

2.   How realistic is the fault distribution?

a.    Uniform distribution across fault types

3.   How realistic was recovery?

a.    Reloaded same code w/o faults

    1. Performance

                                              i.     Need to show speedup / CPU utilization separately

                                             ii.     Else cpu increase is masked for non-cpu bound tests

                                           iii.     QUESTION: what about multiple drivers at once?

    1. Complexity:

                                              i.     22,000 lines of code. Is this a lot or a little?

    1. QUESTION: how do you balance performance drops and increases in reliability / availability?
  1. QUESTION: What is your take?
    1. Paper issues

                                              i.     Could have written paper as ÒHow to make Linux device drivers execute reliablyÓ

1.   Talk about changes to Linux data structures

                                             ii.     Instead, presented as:

1.   Architecture

a.    Generic approach, not many choices

b.    E.g. could use virtual machines, could use software fault isolation, could use java

2.   Implementation

a.    Specific set of choices, specific OS, specific isolation technique

                                           iii.     Makes paper more general, stronger