Paper Discussion - Fall 2008 - CS 736: Improving the Reliability of Commodity Operating Systems

« Why Do Computers Stop and What Can Be Done About It | Main | Hypervisor-based fault tolerance »

Improving the Reliability of Commodity Operating Systems

Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Commodity Operating Systems. in Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct. 2003.

Reviews due Tuesday, 12/2.

Posted by Michael Swift on November 25, 2008 10:58 AM | Permalink

Comments

In this paper the authors present Nooks, an operating system subsystem that allows existing OS extensions to execute safely in commodity kernels. The major goal of Nooks is to provide isolation and recovery and be compatible with existing systems and extensions. The authors describe the architecture of Nooks and then present experimental results that demonstrate that Nooks provides a substantial reliability improvement.

Extensions such as device drivers have become increasingly prevalent in commodity systems such as Linux and Windows. Extensions are a leading cause of operating system failure. For example, in Windows XP, drivers cause 85% of reported failures. For these reasons, the authors decided to create Nooks, a subsystem that reduces the number of crashes due to extensions and that can be used in today’s platforms.

One of the main contributions of the paper is that the architecture that is presented can be applied to commodity operating systems. The authors don’t propose a new extraction architecture. More specifically, a new reliability layer (NIM) is inserted between the extensions and the OS kernel. NIM prevents extension errors from damaging the kernel through its isolation mechanisms. To achieve that, every extension executes within its own lightweight kernel protection domain. A new kernel service, XPC is created to transfer control between trusted domains. The Nooks interposition mechanisms ensure that all control flow occurs through the XPC mechanism and that all data transfer is managed by Nook’s object tracking code. Wrapper stubs are used in order to achieve that. The object tracking functions controls all modifications to the data structures that are manipulated by extensions and provides object information for cleanup when an extension fails. Nooks also supports recovery functions which detect and recover from a variety of extension faults. Another contribution of the paper is the test methodology. The authors use synthetic fault injections to insert faults in linux kernel extensions. They also insert manually common faults that happen. They test a variety of different cases, like system crashes, non-fatal extension failures and recovery errors. Finally they also test the impact their method has on the performance of the system. They use different kinds of benchmarks to evaluate the cost.

A flaw of the system is that it is designed to protect the OS from misbehaving extensions but it is not designed to detect erroneous detection behavior. As a result Nooks can’t recover from this kind of faults. However, we can see that Nooks performs better even when this type of faults happen (figure 7).

The main techniques used in the paper is the insertion of a reliability layer that insures isolation , interposition and recovery, the use of the XPC mechanism and the use of wrapper stubs among others. Nooks improved significantly the reliability of the system. The experimental results show that 99% of crashes were eliminated. The tradeoff is sometimes overhead in the total performance of the system. Experiments with various benchmarks showed that the performance penalty can be 60% in some cases. As the authors state, when performance matters more than reliability isolation may not be appropriate.

Posted by: Avrilia Floratou | December 2, 2008 07:55 AM

Improving the reliability of Commodity Operating systems

Summary

The paper addresses the problem of driver caused unreliability in
commodity operating systems and proposes executing drivers in light weight
protection domains for fault resistance and track memory accesses by the
driver for clean shutdown and recovery of the drivers upon failure. A
synchronized copy of the Kernel page table and Extension Procedure calls (XPC)
are the ways by which the lightweight protection domains are implemented.

Description of the problem being solved

The problem being solved is a practical solution to the problem of
unreliability of commodity operating systems. The focus is on drivers because
they constitute a major portion of the modern operating systems and they are
written by third party developers and are often not well tested or mature.

Contributions of the paper

1) Novel lighweight protection domains as a mechanism for isolating the
drivers in the popular macrokernel operating systems where drivers share the
same address space as the kernel is a contribution. These protection domains
are heavier than threads ( because of the page table copy and object copying )
and less heavier than processes.

2) Extension procedure calls ( XPC ) as an approach to restricting accesses to
the kernel memory from drivers is a contribution. Objects accesses from
extension to kernel are converted to XPCs. Accesses to heavily accessed
objects are optimized by maintaining shadow copies synchronized at the start
and end of each XPC call.

3) The solution proposed needs minimal or no change to the tons of existing
drivers. So, it is backward compatible. This is a huge win for easy adoption
of the solution.

4) A method for recovery of the extension after faults is a contribution :
object tracking allows safe cleaning of state of the extension. Also, in a
subsequent paper the author proposes monitoring the state of the extension to
build shadow device drivers. Hence recovery of the faulty device drivers is
seemless and automated and is a huge plus.

Flaws in the paper

1) The reliability of the wrappers themselves are a question. Wrappers and
object access optimizations across thousands of driver classes and driver
versions is a huge task.

2) The Copy overhead during XPC for the object accesses is an overhead. Also
not sure if Nooks is has architectural support for performance in all
architectures. For e.g. x86 has performance cost for TLB flush during XPC
calls.

3) More and more devices and their upgraded version come to market often and
this needs changes in the driver. Version changes in drivers might need
changes to the nooks layers for the driver too. However, this is a hard
problem to solve. A solution that works seamlessly for all versions of
drivers.

Techniques used to achieve performance

1) Abstraction : Lightweight protection domains are a layer of abstraction
that provide the necessary isolation for the driver extensions.

2) Batching : Batching is used as an optimization technique reducing the
number of XPC calls.

Tradeoff made

1) Performance Vs Reliability : Additional page table maintenance, Object
tracking, Object copying during XPC calls , coverions of memory accesses from
extension to kernel into XPC calls all come with a cost but improve
reliability.

2) Backward Compatibility Vs Performance: A solution that works with today's
device drivers with minimal changes achieves backward compatibility at the
cost of performance. For example, use of sophisticated languages or advanced
architectures might provide similar reliability at the cost of backward
compatibility/adoption.

Another part of OS where this technique could be applied

This technique can be applied in all systems where third party , untrusted
code is executed in a common platform. Some apt places are :

1) Browsers and plugins : Reliable browsers are following approaches of
isolating plugins through processes ( e.g. OP Browser at UIUC by Prof. Sam
King ).

2) Facebook platform : Third party developer written applications should not
crash the facebook site.

3) Google App Engine: Third party website hosted should not crash the hosting
platform.

Posted by: Leo Prasath Arulraj | December 2, 2008 07:48 AM

Summary:
This paper describes, nooks, a reliability subsystem that prevents majority of driver-caused crashes with little or no change to existing driver and system code. It isolates existing drivers within lightweight protection domains inside kernel address space, preventing them from corrupting kernel and keeps track of driver’s use of kernel resources to roll back in case of failures.

Problems:
Most of the computer crashes are attributable to kernel extensions. The OS extensions are becoming primary component, the drivers are complex and hard to write, and they are written by programmers less experienced in kernel organization, and are prone to cause errors. The core operating system kernel is becoming more robust but extensions aren’t and they are the major cause of failures. Thus, improving the reliability of systems would require high fault tolerance of device drivers and fallback incase of failures.

Contributions:
• The major contribution is executing each extension in its own light weight address space with same privileges as kernel but limited write access to kernel data structure. The kernel and extension domains communicate through Extension Procedure Call (LRPC type).
• Nooks interposition mechanism takes care that all communication between kernel and extensions go through XPC via wrappers and all data transfer between them is viewed and managed by object tracking code.
• Object tracking code maintains a list of data structure maintained by kernel, controls the manipulation done by extensions and provides objects for fallback when extensions crash.
• Recovery functions detect software faults when an extension invokes kernel service improperly and hardware faults when processor throws an exception like while reading unmapped memory. Faulty behavior can also be detected from outside nooks, i.e. by user or program. Recovery functions access object tracking code to recover.

Flaws: Developing and testing nooks would require high effort and may itself also introduce errors. Nooks impacts performance and may be the paper could have included a table on performance comparison with other driver solutions other than nooks.

Techniques: The reliability of the system is achieved by adding a layer of indirection between kernel and extensions, and using an object tracker to keep track of kernel data structure and changes made by extensions so as to rollback in failures.

Tradeoffs: They traded off performance for reliability.

Another area of application: Object tracking and recovery function is similar to transaction, and thus can be applied to transactional memory and similar areas. This feature can also be applied to any user application.

Posted by: Rachita Dhawan | December 2, 2008 07:46 AM

Nooks' core contribution is a viable design for driver isolation that does not require modifications (almost ever) to the host operating system or the drivers (outside of extension interface redirection). The implication is clear: reliability can be significantly improved via Nooks without obsoleting most of the extension code which enables the OS commercially. This is an immediately realizable item of research, which in concert with its contributions to the space probably earned it the Best Paper Award.

Nooks' central principles are important. First, that it does not seek to prevent all faults, but rather most of them. Second, that it deliberately ignores the existence of malicious extensions. Nooks is designed as a safety net for implementation errors made by well-meaning extension developers, not a perfect fault-containment system. By making this clear, Swift ignores the chase for a perfect solution and instead chases something which can be implemented and which has the potential to achieve its goals.

Isolation is achieved in Nooks by manipulating page tables to restrict an extension's ability to write in the kernel address space. This is where Nooks starts to impose a performance penalty, because changing page table entries invalidates the TLB and must be triggered by introducing a shim between the two ends of existing kernel-to-extension calls. Lightweight Remote procedure calls are the mechanism chosen to introduce this shim (similar to the sort suggested by a Bershad paper, cited), and thus Nooks inherits the safety provided by LRPC and also the low cost of that safety. The performance penalty that is measurable with Nooks is directly related to the rate of remote procedure calls. Notable, however, is that this penalty is only realized when the system is under heavy duress, and untagged TLB flushing on x86 ("kill everything at once") seems to be the greater cause, rather than the actual work performed by Nooks.

Nooks' incredibly high 99% reported success rate is, notably, only for system crashes and livelocks. Driver failures where the system does not crash or generate an exception are not detected by Nooks, and this includes a large class of errors where the device is left in a a non-functioning state. This is the most Nooks can do without external aid or application-specific instruction, because the only guaranteed common reaction to failure is a processor exception or a system crash. Similarly, recovery when restarting an extension is not always effective; the file system tested in the paper was particularly prone to corruption when a fault caused a restart of the driver. Unfortunately, this possibility makes Nooks a scary addition to a production server. Here, an application-specific recovery plan reduced the number of bad recoveries from 90% to 10%, which suggests that there is an alternative solution (though it may not be general).

Nooks cannot stop a driver from arriving at an otherwise inconsistent state because it is not naturally aware of the invariants of that code, and that inconsistent state may be problematic for recovery. Application-specific approaches work for recovery, but they must be written and they must be themselves correct. An alternative may be available from the programming languages world: Dynamic invariant detection (such as "Dynamically Discovering Likely Invariants," by Ernst et al). The analysis of Daikon is unsound, and may over-report the set of actual invariants, but with the promised results Nooks could detect internal faults before they generate an unrecoverable failure.

Posted by: Tack | December 2, 2008 07:37 AM

Swift et al present Nooks, a mechanism for improving the reliability of existing operating systems. Nooks provides isolation and recovery, without sacrificing compatibility with existing driver code.

Nooks had to overcome a significant set of problems. First, the authors created special wrapper code on entry functions for protection of transfers between the kernel and drivers. Second, they created an object tracking interface that manages access to various kernel data structures. The monolithic design of Linux which manipulates data structures from multiple locations directly made implementing this change a lot harder. Finally, the special interaction of driver code with the kernel, where multiple control transfers happen for a single operation motivated the design of a lightweight protection mechanism with an associated procedure call component to guarantee isolation.

The major contribution of the work is that it addresses a real problem and gives a feasible solution. Evidence of this is the design for backward compatibility instead of relying on a new kernel, a new architecture or type-safe languages.

One part of the paper that could be improved is the evaluation methodology. The authors state that reliability is increasingly important because of the high costs associated with failures. However, small desktop computers with bugs in sound card or network drivers will rarely cause expensive failures. I doubt that a user would ever complain to the IT support group if his sound card would suddenly stop working but everything can be fixed by a restart. Recovery, therefore, has small value in this environment. Isolation is still important, as avoiding a "blue screen of death" in the event of a random driver failure will improve user satisfaction, but it can be achieved with simpler and more conventional protection mechanisms. The real value of this work can be demonstrated by investigating driver failures on commercial installations, which might be different than the failure model used in the paper. Finally, I am not convinced that backwards compatibility is as important: Windows 2000 had an entirely different driver architecture than Windows 98, but every hardware vendor that wanted to stay in business ported his drivers to the new architecture.

The system design makes it straightforward to use existing driver code with the Nooks layer to achieve better reliability. On the other hand, this comes at significant programming effort. Maybe a transactional interface would both guarantee atomicity and make bookkeeping easier by delaying effects of all operations until commit.

Posted by: Spyros Blanas | December 2, 2008 06:30 AM

Introduction:
This paper introduces the Nooks reliability subsystem, which aims at improving the reliability of an operating system, by isolating potentially faulty extensions from other parts of the kernel. The Nooks system also provides mechanisms to aid cleaning
up after an extension, and also recovering from an extension fault.

What were they trying to solve:
After performance, reliability is probably the most important consideration in modern operating systems. Most of OS faults arise in device drivers, which is a serious problem since device drivers often need to execute in privileged mode, and therefore have the
potential to bring down the whole system in case of a fault. Also, there are no well defined to recover gracefully from a crash.

Contributions:
Extensions are isolated from each other and the kernel through lightweight kernel protection domains within the kernel address space.
Maintaining backward compatibility with existing system is a primary design goal. Communication between the kernel and extension is carried out via extension procedure call(xpc). The Nooks interposition convert existing data interchange into
xpcs using generated wrapper stubs.
Object Tracking keeps track of what resources are being used by an extension. This also ensures isolation by copying data back and forth between the kernel and the extension.
Nooks recovery manager handles both software and hardware faults. For hardware faults, recovery is always triggered, whereas for software faults whether or nor recovery is triggered is encoded as a policy.
Deferred call mechanism for batching multiple function calls for performance.
Sharing of wrapper code across similar drivers in the same class.

Flaws:
As noted in the paper itself, putting extensions in different domains causes a performance hit on x86 systems.
The overhead of copying data between kernel and extensions.

Techniques used to improve performance
Modularizing and Isolating systems for reliability.
Batching multiple calls to minimize overhead.
Tradeoffs:
Maintaining Backward compatibility vs Higher Reliability
Performance vs Reliability: making the extensions run in non-privileged mode would increase reliability but would incur additional overhead.
another part of the OS where the technique could be applied:
Any software where there are additions/extensions/plugins which are embedded into the application: Browsers, Audio Players etc

Posted by: priyananda | December 2, 2008 04:46 AM

Summary

The paper describes the architecture and implementation of Nooks -a reliability subsystem - which enhances OS reliability substantially by isolating OS from driver failures, assesses the improvement in reliability that Nooks offers and finally measures Nooks' impact on performance. Nooks recovers from about 99% of the faults that crashed Linux.

Problem attempted

A significant percentage of the crashes in commodity operating systems is due to driver failures. The objective of the paper is to develop a reliability subsystem that is compatible with the existing operating systems & drivers that can eliminate almost all of the system crashes caused by buggy drivers (and extensions in general)

Contributions

1) The Nooks project addresses the reliability problem in OSes by adding a transparent subsystem in the OS that makes the OS fault resistant against buggy driver codes. The goals of this subsystem are isolation of kernel from driver failures, automatic recovery from driver failures and compatibility with existing OSes.

2) Nooks achieves isolation by making extensions run in a light-weight protected domain which has kernel privileges but has read-only kernel access. Synchronized copy of kernel page table is maintained for each extension domain to support read accesss to kernel for extensions.

3) An extension Procedure Call (XPC) is used to transfer control between the kernel and extension domains. XPC manages control transfer through two functions - one for kernel to extension and the second one for the reverse.

4) Nooks interposition mechanism ensures that all control transfers between kernel and extension take place through XPC and all writes to kernel objects is managed by object tracker. The object tracker records all kernel objects in use by an extension.

5) Transparency - the essential aspect of backward compatability - is implemented through wrappers. A wrapper performs three crucial tasks - validating the parameters by checking with the object tracker if the pointers are valid, creating a copy of kernel objects within the extension's protection domain and perform an XPC

Flaws

1) The paper does good self critiuquing - The overhead imposed by changing protection domains on an extension procedure call (particularly the TLB flush that occurs in X86) has been reported by the authors.

2) The authors also mention the necessity to observe kernel-extension interaction for every extension to determine the set of objects to be tracked by the object tracker.

The performance measurements indicate that the overhead in performance caused by Nooks can be as high as 60%. But I believe that this is a reasonable and inevitable trade off for what Nooks gives in return - substantially higher reliability

Tradeoffs

1) Compatibility vs Completeness - A complete isolation of the operating system would eliminate all crashes due to drivers but such an implementation will almost certainly necessitate proposing new architectures that are incompatible with existing OSes.

2) Reliability vs Performance - This is an almost necessary tradeoff that has to be made by any implementation that seeks to improve reliability

Techniques used

1) Transparency through stubs - Writing stubs is a general technique used for providing transparency. For example, RPCs use stubs to mimic the local procedure call mechanism

Posted by: Balasubramanian Sivan | December 2, 2008 04:04 AM

Summary:
85% of Windows XP failures are caused due to driver failures. Nooks present a reliability subsytem for commodity OS, which can help to isolate OS from extension failures. For 2000 fault-injection tests, it recovers automatically from almost 99% of faults. Backward compatibility and usage of lighweiht kernel protection domains are two major features of Nooks.

Problem:
This paper presents Nooks, that improves upon OS reliability by isolating it from driver failures. Nooks differ from earlier works in two very strong ways. First, it targets existing commodity OSes instead of new extension architectures (such as Singularity, exokernel, etc). Second, it builds upon C (the New Jersey approach), unlike type-safe languages.

Contributions:
Nooks seeks to achieve three major goals. Isolation of kernel from driver failures is the first major contribution. Detection of failures using Nooks Isolation Manager layer helps in achieving this.
Automated recovery from driver failures, which enables running applications to continue execution even if they depend on a failed extension. This is done in two phases - disabling interrupt processing for the particular device, followed by the execution of user-mode recovery agent.

Last but not the least, backward compatibility which enables Nooks to apply to existing systems and extensions, with minimal changes to each. (22K lines of code as compared to 30M line code base for Linux kernel 2.4).

Other than this, Nooks present an implementation for Linux - a commodity OS. This clearly shows the magnitude of effort behind Nooks. Things like implementing NIM in ring 0, to facilitate portability shows a good forthsight :-).

Flaws:
As pointed out by the authors, Nooks recover from 99% of the system crashes. But the sample space for experiments and this result is just synthetic random bugs. How these bugs were generated is somewhat unclear in the paper and how closely they represent real-world scenarios? What about the rest 1% (4 deadlock scenarios) - are those the actual common cases?
Authors mention about portability over Solaris and Windows. However, aggresive policies (such as those taken by Windows XP) to crash on kernel processor exception, will clearly force changes to recovery mechanisms.
Authors mention using configuration files for user-mode recovery. What are these, how are they generated or written?

Techniques Used:
Overall a great paper. (SOSP '03 best paper!)
Wrappers and XPC stubs are symbolic of isolation primitives. Automated recovery mechanisms for system and non-fatal crashes are symbolic of fault-resistance against mistakes and not abuses (malicious drivers). Protection for virtual memory and TLB accesses using conventional page-table architectures.

Tradeoff:
Reliability vs. performance. For kernel and device driver developers, reliabilty is utmost important till the point they have working extensions. Performance can later be optimized either using fast hardware or fine-tuning the driver by using efficient data structures or techniques. Nooks focus on reliability.

Alternative Uses:
Sand-boxing and resource containers follow similar principles for isolation. LRPC is along similar lines as is XPC. Reference validator (OSDI '08) is somewhat along similar lines as in NIM. Protection domains are used in the form of processes for almost all OS.

Posted by: Mohit Saxena | December 2, 2008 02:18 AM

Improving the Reliability of Commodity Operating Systems

SUMMARY
This paper introduces Nooks, a subsystem that improves reliability for commodity OS with kernel extensions. It prevents faults by isolating extensions in a lightweight protection domain in the kernel, when faults are detected a recovery mechanism is followed. This is achieved by introducing a reliability layer that sits between the kernel and the extensions: the Nooks Isolation Manager (NIM). Nooks can detect and recover 99% of the faults that make the Linux kernel crash.

PROBLEM
While the cost of failures in current systems is becoming more expensive, extensions are becoming more abundant and they are the most common cause of failure in commodity OS. To improve reliability they propose the NIM that must satisfy three points: isolation of the kernel from extension failures, automatic recovery of failures and compatibility with existing systems and extensions.

CONTRIBUTIONS
It introduces a subsystem that supports current C-extensions and runs on a commodity OS without special hardware support. No changes are necessary to the hardware, OS or the existing extensions. This is a feature that most proposed solutions did not have.

They implement NIM that has four main functions:
Isolation: implements the protection domain with virtual memory and a control transfer mechanism (XPC) between kernel and extensions.
Interposition: all the control flow between extensions and the kernel happens through the XPC and all the data is followed by the object tracking function.
Object tracking: keeps track of the kernel structures used by the extension and uses that information to perform recovery.
Recovery: mechanism that can return control to the extension with an error code or execute a recovery program for extension faults. Returns the system and extension to a known state.

They use wrapper stubs between kernel and extension functions. These wrappers provide transparency to the kernel and the extension by providing the same API as the function calls and hide the Nooks layer.

A detailed testing is done by injecting faults in very different extensions: device drivers (network and sound cards), kernel subsystem (file system) and application-specific kernel extension (Web server). They prove that the system crashes are reduced in 99%. The non fatal faults are reduced in 60%.

FLAWS
Although architectural modifications are not needed to the Linux kernel and to the extensions, they need to be modified. It is not very clear what are the modifications that are needed to some of the extensions. All the testing is performed with synthetic inserted flaws. It is also not clear if the testing is performed by having all the extensions running simultaneously in the system or not.

PERFORMANCE
They provide reliability through an additional layer between the kernel and the extensions that isolates them and performs a tracking of the extension operations to detect flaws and a recover mechanism that executes when a flaw is detected.
Compatibility and transparency are chosen over completeness. Making Nooks compatible with existing OS and extensions makes it not be able to detect all the faults.
Object tracking, shadow copies and wrappers can be used in transactions, servers systems and remote procedure calls.

Posted by: Paula Aguilera | December 2, 2008 02:16 AM

Summary
This paper describes Nooks, a fault resistant, transparent virtualization layer between the kernel and device drivers and other kernel extensions, enabling recovery from many driver and extension failures.

Problem
Device drivers are a common source of OS failures; they are often written by less experienced programmers or programmers not involved in writing the kernel. It is difficult, if not impossible to test the interactions between all the sets of kernel extensions that will be run together on real configurations. In order for a reliability subsystem to be viable for existing systems with many existing drivers and extensions, there must not be a prohibitive impact on performance and we would like to leave the code of existing drivers and extensions unchanged.

Contributions
Nooks aims to isolate the kernel from failures in drivers and extensions, to recover from such failures, and to be backward compatible. An isolation mechanism provides an extension with a lightweight kernel protection domain, with the same processor privilege as the kernel but limited write access. The interposition mechanism provides transparency; that is, stubs in the interposition mechanism appear as the kernel extension API to the extension and as the extension’s entry points to the kernel. Object-tracking functions enforces that kernel objects must be copied into the extension’s domain (and copied back after changes have been made). This allows for the release of resources after a failure. A recovery agent, by default, attempts to reload and restart the extension, although other action may be taken.

Flaws
While I am left convinced that, by and large, transparency is achieved, I am left curious as to why one of the eight extensions required changes and what the nature of those changes were.

Techniques & Tradeoffs
The techniques and tradeoffs pursued are largely as a result of the identified design concerns, specifically backward compatibility and transparency in the kernel-extension interface, specifically from the viewpoint of the extension. For these, we must trade a bit of performance; not all extensions must use the isolation manager - the authors suggest that the performance tradeoff can be evaluated on an extension by extension basis. For many common drivers, the performance hit is fairly negligible. As well, the authors gain some simplicity and performance by a design that attempts to prevent and recover from most, but not all failures, and by focusing design efforts on preventing mistakes rather than abuse.

Posted by: Sam Javner | December 2, 2008 02:04 AM

Summary:
This paper introduces a new kernel subsystem Nooks, which has the goal to prevent the vast majority of driver-caused crashes with little or no change to existing driver and system code. To achieve this, Nooks uses a variety of methods to isolate the OS, including classic hits as protection domains, a form of LRPC dubbed Extension Procedure Call (XPC), and wrapper stubs. Nooks isolates extensions within lightweight protection domains inside the kernel address space with requiring little or no changes to extension and kernel code, its solution is practical, backward-compatible and efficient.
The goal the paper was trying to deal with:
Computer reliability is still an unsolved problem and cost of failures continues to rise. OS extensions have become prevalent, and they are the main reasons for the system failures. For example, drivers are causing 85% of failures in Windows XP, while device drivers trigger 7x errors than the rest of the kernel in Linux. Because of the above reasons, the goal of this paper is to implement nooks, which can eliminate most downtime caused by drivers, prevent system crashes by using isolation, and keep applications running by using recovery.
Contributions
1． This paper designed and built a new kernel subsystem which prevents majority of driver-caused crashes, requires no changes to existing drivers, requires only minor changes to OS and minimally impacts performance.
2． The paper presents a subsystem which isolates the OS from device driver failures by executing each driver in a lightweight kernel protection domain, which is a privileged kernel-mode environment having restricted write access to kernel memory. This approach has advantages on providing isolation between kernel and device driver without compromising backward compatibility.
3． Nooks' interposition mechanisms are implemented by using wrapper stubs either executing kernel-supplied or driving supplied functions. What is more, wrapper code sharing is also an important feature of Nooks. By sharing the wrapper code among different drivers in a class or across classes, the total amount of code added to the kernel reduces greatly.
4． Nooks tracks a driver's use of kernel resources to hasten automatic clean-up during recovery, it also allows existing OS extensions to execute safely in commodity kernels, and track and validate all modifications to kernel data structures.
Flaws:
(1) The first flaw in this paper is that recovery might be safe only for dynamically loaded extensions.
(2) The second flaw is that parameter checking of Nooks is incomplete.
(3) If extensions run in kernel mode, they may execute privileged instructions and may loop forever (Note that: Nooks detects livelock).
(4) The performance section of the paper shows that the execution time of most benchmark programs goes up due to Nooks mechanisms. One of the benchmarks was even slowed down by a factor of two. Of course reliability is very important, but we should have some concerns on that if it is reasonable to provide somewhat more reliability at the cost of reduction in speed by a factor of two.
The techniques used to achieve performance:
Isolation, interposition, object tracking and recovery are main techniques used to achieve performance in this paper. Lightweight kernel protection domain is implemented in this paper and writes access is confined to a limited portion of the kernel’s address space. The Nooks interposition mechanisms make sure that all control flows between the kernel and extensions are through the XPC mechanism and all data flows between the kernel and extensions are managed by Nooks’ object-tracking code, and extensions and the kernel communicate through wrapper stubs. Object tracking maintains a list of kernel data structures that are manipulated by an extension, controls all modifications to those structures, provides object information for cleanup when an extension fails, Object tracking code verifies the type and accessibility of each parameter being passed. Since extensions are decoupled from kernel, Nooks can freely release extension-held kernel structures, such as objects or locks, during the recovery process.
Tradeoffs:
(1) Nooks does not provide a complete isolation or fault tolerance for all possible extension errors. Nooks runs extensions in kernel mode for backward compatibility, so it cannot prevent extensions from deliberately executing privileged instructions that corrupt system state.
(2) Nooks does not prevent infinite loops inside of the extension, but it does detect live lock between the extension and kernel with timeouts.
(3) Nooks checks parameters passed to the operating system, but it cannot do a complete job given Linux semantics (or lack thereof). Its current implementation of recovery is limited to extensions that can be killed and restarted safely.
(4) Nooks can improve reliability with some sacrifice on increasing the execution time.

Posted by: Tao Wu | December 2, 2008 01:25 AM

Summary: Based on the observation that OS extensions/drivers are the most common source of failures, Nooks proposes a best effort approach at improving the reliability of commodity OS, with minimal changes to the kernel and existing extensions. The approach is to interpose a Isolation Manager (NIM) between the kernel and extensions. NIM performs sanity checks, and maintains state needed for error recovery.

Problem: OS extensions are often less reliable than the kernel, and yet they execute at full kernel privileges which makes them able to bring the entire system down. Academic solutions involving specialized hardware, languages or OS redesign don't account for the reality of large amounts of commodity software (OS and extensions) that won't be easily replaced. Moreover, errors will still happen and recovery is desirable.

Contributions: Nooks shows that even under the constraints of legacy software, a best effort approach can have huge impacts on reliability. It makes the distinction between malicious code and simply buggy code, and tries to address the later. It proposes lightweight (hard to break by honest mistakes) protection domains that share the kernel address space and it makes error recovery a design goal. The proposed design philosophy (the use of a reliability layer) is general, but a Linux implementation is offered as proof of concept. I also liked the idea that a user program is allowed to use Nooks when it thinks fit.

Reliability Techniques: Isolated drivers execute in protection domains which offer read access to entire kernel address space but write access only to their own domain. Extension Procedure Calls (XPC) ensure safe transfer of control between kernel and extensions. An XPC modifies the page tables to restrict write rights to certain areas and saves the caller's context on a stack in the new domain. Copies of kernel pages may be required to allow writes. A reliability layer monitors the communication between the kernel and the drivers. Errors are detected as bad parameters, timeouts, or page faults caused by bad writes, and clean termination (reclaim resource, garbage collection) and various recovery mechanisms may be attempted. This requires additional bookkeeping by an object tracker.
Batching of XPC calls is performed for performance.

Tradeoffs: Performance suffers both because the CPU overhead involved in wrapper work, and due to TLB flushes.

Another place where the technique could be/was applied: wrappers for allocation routines.
Although this is a potential general isolation technique (applicable to other software executing extensions) it needs kernel privileges in the current implementation.

Weaknesses:
-I think that the main weakness is that it requires a lot of 'inside knowlwdge' about the isolated drivers. On one hand Nooks attempts to not change existing extensions, but on the other hand it seems to require such a deep level of understanding of the extension, that one might just as well change it.
- security needs to be addressed any way, so why not a common solution?
- some of the testing
-It would be interesting to describe Nooks using the "test of time" (maybe on a University machine) rather than by injecting faults.

Posted by: Daniel Luchaup | December 2, 2008 01:18 AM

Summary:
The paper addresses reliability in commodity operating systems. Specifically it develops a subsystem which can be incorporated into the OS to enhance reliability by furnishing a layer of isolation between the kernel and various extensions.

Problem addressed:
The primary problem addressed is the reliability or lack of it of many operating system extensions. Contemporary OS behaviour often makes the OS crash due to exception conditions in the extensions. But a more desirable behaviour would be to have the extension terminated without corrupting the kernel.
Likewise, commodity systems lack restartability for isolated entities in the OS inspite of the research that has gone into restartability. The paper addresses this issue of restartability too.

Contributions:
- The paper introduces a feasible layer of isolation between the kernel and the extension. This is achieved using wrappers, Light weight processing domains and XPC mechanisms.
- Commendable reliability and recovery are achived with relatively minimal code changes
- The Nooks layer allows for backward compatibility thereby lending itself to be easier to deploy.

Flaws/Limitations:
The big flaw I see with the approach is performance. The Light wieght protection domains used require TLB flushes in the existing x86 based processors. The call-by-value-return semantics require copying of parameters. Likewise the XPC mechanism sets up a new stack to run the procedure on. The way I see it, the original kernel-modules concept was favored in linux over a microkernel & user-level services approach for performance. With nooks, the Light weight protection domain seems to me to be only slightly less than separate processes, thereby making them suffer in terms of performance.

Technique:
The technique used is addition of a layer of indirection. The nooks NIM is now a new layer between the kernel and the extensions. This layer provides services like validation, object tracking, recovery etc.

Tradeoff:
The tradeoff in running the extension in a separate protection domain is between reliability and performance. The paper achieves higher reliability at the cost of performance.

Alternate uses:
The key idea here is to wrap the extension with a wrapper which oversees the extension and provides for reliability and recovery. Thus this technique is applicable in any environment which uses plugins or extensions. Eg: Microsoft office or Firefox web browser with their numerous extensions.

Posted by: Varghese Mathew | December 2, 2008 12:40 AM

Summary
In this paper, Swift et al present a subsystem, Nooks that increases the reliability of commodity operating systems while being compatible with the existing hardware and driver modules. Nooks has been designed for gracefully handling and recovering from unintentional(not malicious) errors in the drivers without requiring a system restart or leave the system in a corrupted state.

Problem
Majority of crashes in present day operating systems are caused by bugs in the driver/modules. These modules are not written by the people who actually designed the operating system and are sometimes are not that experienced. So despite efforts to make the operating system code bug-free, these modules cause the entire OS to crash. Existing solutions either require hardware/software modifications(Capabilities/Microkernel) or totally isolate the subsystem(VM)

Contributions
- Nooks introduces a layer of indirection between the kernel and the extensions(driver/modules) interface. All communication between these two layers passes through the Nooks Isolation Manager(NIM) by using an Extension Procedure Call(XPC).
- NIM helps in isolating the extensions by maintaining a separate domain for each extension and copying the kernel data structures used into this space. This prevents any failures/bugs in the extensions to cause a crash or corrupt the system.
- XPC provides a lightweight process switch from the kernels protection domain to the extension's domain and vice versa. Wrapper stubs emulate the interfaces on both the sides, thereby providing a transparent layer that allows the modules to work without any changes.
- Since all calls to the kernel are made through the wrapper stubs, they can track the space/objects used by each extension and prevent any corruption. This ability to track objects also help in recovery/cleanup after a driver failure.

Flaws
- Maintaining separate isolated domains has a definite negative impact on the performance because of the TLB flush and extra copy.
- Despite the tool to create the wrapper stubs, each of these methods need to be manually verified to ensure correct functioning.(e.g. not deleting an object that is used across calls and deleting ones that are used only once) With the increasing the number of modules and their complexity, this wouldn't be easy to extend to all modules.

Design
- Nooks adds a layer of indirection to give it control over the interaction between extensions and kernel and isolate the extensions. To some extent, it is like a VM between the kernel and modules.
- To reduce the total number of context switches, Nooks batches the XPC between the kernel and extensions(specially n/w drivers).
- Nooks compromises on the performance of the operating system for better reliability and backward compatibility of the system.(not using segmentation to avoid TLB flushes).
- Any system that allows plug-ins or untrusted code in their system that is not completely isolated, can use the concepts discussed here.(e.g. Firefox plug-ins)

Posted by: Tushar Khot | December 1, 2008 11:40 PM

Improving the Reliability of Commodity Operating Systems

Summary
This paper implements a reliability layer between the device drivers and kernel in order to improve the reliability of drivers caused by mistakes rather than abuse. This layer isolates the driver from the kernel with wrapper stubs that interface with the driver instead of directly interfacing to the kernel.

Description of Problem
Extensions, also known as drivers or modules, account for 85% of operating system failures. In addition the cost of failures through service and help imposes a value in designing reliable systems. The number of extensions continues to increase and the programmers of these extensions typically do not have the experience needed to make them reliable.

Summary of Contributions:
- Implementation of a reliability layer that has little effect on existing extensions
- Developing a way to isolate common programming extension faults that were not malicious
- Authoring a paper that clearly shows the advantages and disadvantages.

Flaws
Although Nooks does have an impact on the reduction of system faults, the cost of performance in order to achieve a more reliable system by increased dynamic checking does not seem to persuade me that Nooks is an answer to reliability. To start with, 85% means that 15% will definitely not be helped through this system. In addition, Nooks has a large number of exceptions to solving the reliability problem. For example Nooks is: limited to extensions that can be killed and restarted, unable to handle infinite loops and deadlock, and unable to know if the driver gets stuck in non-functional state. In the reliability experiments, only 60% of faults caused by the fault injection were solved and those injected by hand does not state what they were causing. The reduction in performance of 25% during kernel compilation, 18% decrease through high frequency sending, and 60% decrease on a kernel http driver is not convincing.

Techniques
The techniques used to achieve the desired solution include virtualization of the kernel/driver interface and dynamic input and object validation, . The main tradeoff is reliability for performance. The technique of virtualizing an interface to change the behavior can be used at nearly any interface. The technique and design presented in this paper could be very valuable as a developers extension analyzer to help work out the bugs prior to software release.

Posted by: Cory Casper | December 1, 2008 11:11 PM

Summary
Despite continuing research in the area, current operating systems still suffer from reliability issues. In "Improving the Reliability of Commodity Operating Systems," Swift et al. describe Nooks, a subsystem implemented in the Linux kernel that improves system reliability by isolating drivers from the kernel.

Problem
Current operating systems are not yet perfect – all too often, these systems suffer crashes and other related reliability problems. Research has provided systems with improved reliability over prominent modern operating systems, but the legacy requirements of many modern systems prohibit full exploitation of some of the reliability research. Swift et al. observed that the vast majority of operating system problems arise from kernel extensions such as device drivers. This being the case, the authors describe a kernel subsystem called Nooks, which attempts to improve reliability by isolating these problematic extensions.

Contributions
· Pragmatically approaching the reliability problem in modern operating systems by developing a method of isolating less reliable components (kernel extensions) and preventing them from causing more severe system problems that could be applied with minimal difficulty to current systems.
· Attempting to automatically recover from failures. By assuming that most problems are not malicious, the authors hope that simply restarting a failed extension will avoid the problem which occurred and thus allow the extension to continue functioning.
· Experimenting on a variety of real extensions (sound card driver, network driver, file system extension, in-kernel web server) in a real operating system (Linux). The results obtained, of elimination of 99% of crashes due to extension failures, help validate this method as a pragmatic approach to improving the reliability of extensions.

Flaws
· Not giving an account of how many of overall crashes extensions account for on Linux. The authors mention that drivers account for 85% of reported failures in Windows XP, but do not give any such specific data in regards to Linux. While the 99% elimination of crashes due to extensions in Linux is quite a notable achievement, it would be valuable to know just how prevalent crashes due to extensions are in Linux.
· The performance hit of Nooks is shown to be significant (in some cases performing about half as well as in the native case). Such a hit may not be acceptable in such a low-level system as the kernel. It seems though, that Nooks could, at the very least, be used in the testing and debugging phase to catch bugs (the authors state that a variety of bugs were found upon implementation of Nooks) and improve reliability in that way.

Techniques
Overall, Nooks willingly gives up performance in order to achieve reliability. It does this by introducing additional code which forces all extension calls to go through wrappers. These wrappers, in turn, allow for monitoring and control of the various system resources that an extension attempts to access. By so doing, Nooks can sometimes prevent system crashes. In a broad sense, these techniques of isolation were applied early on with processes – each process having access to a memory space distinct from that of other processes. This idea of isolation may someday be used with threads, where a combination of both some of the benefits of threaded performance and some of the benefits of the isolation of processes may be in some way desired.

Posted by: Mark Sieklucki | December 1, 2008 09:30 PM

Summary: The authors of this paper present Nook a subsystem for operating systems, which sits between the kernel and extensions to improve system reliability. Nook requires no major architectural changes to current OS structures, but is able to recover from up to 99% of selected crashes through isolation, recovery and backwards compatibility.
Problem to Solve: As systems have evolved there has been an increasing need for reliability and system reliability has decreased with the introduction of support for thousands of physical devices that operating systems weren't originally designed for. Device drivers and extensions have been a primary cause of system crashes. These faults don't have to result in system crashes if extensions are isolated from kernel.
Contributions: First, the authors of the paper present one of the first extension isolation mechanisms which requires no changes to the current operating system architecture. This is essential because it allows for the minor changes to the operating system code to be implemented for Nooks without rewriting several pieces of the OS or the applications/devices which run on it. Secondly the authors introduce the notion of the Nooks Isolation Manager. This manager is what truly separates the extensions from the kernel to provide that isolation from each other and provides four functions: isolation, interposition, object tracking and recovery. They also introduce XPC which are similar to RPCs. XPC allow for transparent communication between extensions and the kernel, but also allow for object tracking. Another contribution they make is the fact that this entire process for recovery is automated - something that other approaches weren't able to do. This automation is what allows for the reduce in man power to keep these systems up and running, and for recovery to be quick after the system notifies itself of the fault.
Flaws: One flaw of the paper is that the overhead experienced because of Nook can decrease throughput by up to 60%. This is a large overhead which could be infeasible in some real life scenarios. It shows that if CPU usage is already high it is unlikely that Nooks will be a solid solution to increase reliability. Also, I believe it would have been interesting to explore the CPU utilization further to discover the threshold at which performance begins to degrade because the degradation is not nearly as high with VFAT or e1000.
Tradeoffs: The major tradeoff here is reliability for increased performance overheads. Although the overhead is small in come cases it is larger in others making the system less appealing in those cases. Another tradeoff is that recovery is only applicable to extensions and only extensions which can be remotely killed and restarted are able to be kept from crashing the system.

Posted by: Holly Esquivel | December 1, 2008 09:18 PM

Paper Discussion - Fall 2008 - CS 736

Improving the Reliability of Commodity Operating Systems

Comments

Post a comment