« Manageability, Availability and Performance in Porcupine: A Highly Scalable Internet Mail Service | Main | Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors »

Improving the Reliability of Commodity Operating Systems

Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Commodity Operating Systems. in Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct. 2003.

Reviews due for this or other paper Thursday, 4/19.

Comments

Summary
This paper introduces the reader to Nooks, the core concept of which is isolating the OS from problems caused by faulty drivers. Nooks uses a variety of tricks to isolate the OS, including such classic hits as protection domains, a form of LRPC dubbed Extension Procedure Call (XPC), and wrapper stubs. The wrapper stubs are particularly attractive, as they enable isolation with minimal changes to driver code.

Problem
Computers crash, and a lot of these crashes are caused by kernel extensions/drivers. Most efforts prior to Nooks seem content to remain largely theoretical, as they require massive changes in existing code or new programming approaches. Nooks plays to the chief virtues of programmers (laziness, impatience, hubris) by offering a solution that requires little to no change and most (if not all) of the benefits of prior efforts.

Contributions
Object tracking, if it doesn't add too much overhead, seems like a big win. Tracking an extension's use of kernel resources allows relatively easy fault recovery and policy creation/management.

An emphasis on recovery/resistance, rather than tolerance. The authors' premises (mainly that extensions are not malicious) seem to be proven correct by the 99% crash reduction.

Just on a conceptual/personal level I like the sort of "virtualization-lite" approach of Nooks. It allows the OS to schedule and manage memory as it normally would, which eliminates one of the problems with native virtualization. The protection domains seem to offer up similar benefits to "application virtualization" like the JVM.

Flaws
Nooks causes a pretty large performance hit, which makes me wonder if XPC is the right solution. Not that I have a better one... Maybe Nooks could observe fault frequency and somehow gradually allow drivers with low fault rates direct access to the kernel over time (though it wouldn't work so well with nondeterministic bugs)? Failing that, you could just inform Nooks that certain extensions are allowed direct access to the OS.

While Nooks is in my estimation a good immediate solution to reliability issues, I wonder if it unintentionally encourages/rewards sloppy programming.

Reliability
It's hard to argue with the experiment results. Nooks does seem to offer large increases in reliability, at a high cost.

Summary

In this paper, the authors' describe a subsystem called Nooks which greatly enhances OS reliability by isolating operating system from majority of device driver failures. Initially, the authors' present the overall architecture of Nooks and then they discuss how Nooks prevents device driver failures using isolation, interposition, and object tracking. They also discuss how Nooks' recovery function detect and recover from failures.

Problem Description

Most of the OS failures are because of buggy device drivers. It is reported that 85% of Windows XP failures result because of device drivers. In order to get rid of most of the failures caused because of device drivers, the authors' present Nooks, an OS that seeks to enhance OS reliability by isolating the OS from device drivers.

Summary of Contributions

The paper's major contribution is to present a subsystem that isolates the OS from device driver failures by executing each driver in a lightweight kernel protection domain which is a privileged kernel-mode environment with restricted write access to kernel memory. The main advantage of this approach is that it provides isolation between kernel and device driver without compromising backward compatibility.
Another interesting feature of Nooks is its interposition mechanisms which transparently integrate existing extensions into Nooks environment. Interposition code ensures that all the control flow between the driver and kernel flows through Extension Procedure Call. By doing so Nooks introduce a layer of indirection between the driver and the OS and can therefore monitor the control flow between them and can identify any irregular behaviors.
Nooks' interposition mechanisms are implemented using wrapper stubs which either execute kernel-supplied or driver supplied functions. An interesting feature of Nooks is wrapper code sharing. The wrapper code is shared among multiple drivers in a class or across classes. This reduces the total amount of code added to the kernel.

Flaws

The performance section of the paper shows that the execution time of almost all of the benchmark programs increases due to Nooks mechanisms. One of the benchmark was even slowed down by a factor of two. This seems to be quite significant. Even though reliability is important but providing reliability at the cost of reduction in speed by a factor of two does not look that feasible.

Reliability

The authors' main goal is to provide reliability by isolating the device drivers from the operating and by adding layers of indirection between them. Even though they manage to prevent 99% of the failures but it also results in significant performance degradation for some of the applications.


Paper Review: Improving the Reliability of Commodity Operating Systems [Swift, et. al.]

Summary:

Based on the notion that device drivers, or kernel extensions, to
commodity operating systems are responsible for as much as 85% of
reported failures, this work proposes Nooks, an in-kernel isolation
manager and a user-mode recovery agent that limit driver problems from
causing the operating system to fail and perform a driver-specific
recovery action resulting in more reliability.

Problem:

The problem is that commodity operating systems, such as Windows and
Linux, are susceptable to failures caused by errors that arise from
kernel extensions. This is because the drivers have complete kernel
privilege and access to kernel memory.

Contributions:

* Nooks is thorough in that it intercepts kernel-to-extension calls,
extension-to-kernel calls, and even direct kernel data accesses so
that it can validate suspect operations and detect failures. It
acheives a high level of isolation with minimal to no visible change
in the interface between kernel and driver.

* Nooks has fault detection with integrated, configurable recovery
making it a candidate basis for future recovery subsystems in
commodity operating systems.

Flaws:

* The paper states that some of the object tracker code can only be
written "by examining the kernel extension interface". If I understand
correctly, this means that one needs to examine the source code for a
given driver to determine if it's nooks-compatible. And, any given
driver, or new version of a driver, may require the nooks object tracker
to be maintained. In that subset of cases, the goal of "backward
compatibility" (and not having to change the driver itself) is minimized
because nooks instead has to be adapted specifically for that driver.

Reliability Impact:

The evaluation of nooks in the paper is with 8 drivers and some number
of synthetic device driver faults. In this situation, the nooks
technique avoided as much as 99% of crashes over a normal Linux.
However, this prevention comes at a performance cost that varies by type
of extension, so the technique may be best applied to non-performance
critical workstations that might use a variety of suspect drivers.

Post a comment