« Lightweight Remote Procedure Call | Main | Scheduler Activations: Effective Kernel Support for the User-Level management of Parallelism »

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Thorsten von Eicken, Anindya Basu, Vineet Buch, Werner Vogels U-Net: A User-Level Network Interface for Parallel and Distributed Computing in Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain Resort, Colorado, December 1995, 40-53.

Reviews due Thursday, 3/8

Comments

Summary
This paper presents a user-level network architecture with the goals of minimizing latency and loss of efficiency associated with the many small messages that are used in a parallel programs.

Problem
The problem that this paper is attempting to solve is the large overhead associated with each message in the traditional UNIX networking architecture and the problems that it presents when dealing with applications that require several small messages (such as parallel programs or systems using RPC, and networked file systems)

Contributions
The primary contribution of this paper is the demonstration of the necessity of moving the send and receive paths for network messages out of the kernel. This allows buffers and network abstractions to be managed at the user level, allowing each application to choose the facilities that are optimal for the task at hand. In order to make this work for all user processes U-net provides each process with a virtual access to networking hardware. This way the role of the system is only to perform multiplexing between processes using network resources rather than enforcing specific abstractions or mechanisms.

Another important contribution here is the implementation of U-Net active messages on top of U-Net, an implementation of Generic Active Messages, a protocol used for communication in multiprocessor systems that ensures reliable delivery when catastrophic failures have not occurred.

Flaws
Overall i felt that this paper was solid. Probably the largest flaw as they mentioned is that their direct access architecture is difficult if not impossible to implement in a completely correct fashion given the hardware available at the time the paper was written.

Performance
The performance measurements performed in this paper appear to be properly undertaken and provide solid evidence for the superiority of the suggested approach.

Summary:

The paper presents U-Net, a communication architecture which provides a virtual view of a network interface to enable user level access to high speed communication networks. To lower the latencies caused by the intermediate software (read the kernel) processing of packets, in the case of high speed networks, the authors implement U-Net, over off-the-shelf hardware.

Problem:

The increased availability of high-speed local area networks shifted the bottleneck in local area communication from the limited bandwidth of the network fabric to the software processing capability of the end hosts.

Contributions:

The authors present a strong case for optimizing the processing time for small messages citing examples from systems using RPCs, etc. (For bigger messages, the cost of processing is amortized).
The fundamental idea of providing a user-level access to the device has motivations and design choices similar to the exokernel approach - Higher performance, greater flexibility with the kernel just acting as resource multiplexer. Here the authors show a working implementation which achieves the desired goals of full bandwidth utilization within the limitations imposed by the available hardware.
They also show how the flexibility to tweak the TCP/UDP implementations and better buffer management at the application level helps achieve better bandwidth utilization than what can be provided by the kernel.

Flaws:

The multiplexing issues between different applications is slightly glossed over, by expressing the hardware limitations as the problem. The kernel can multiplex any number of applications albeit slowly. (The isolation and fairness in multiplexing isn't covered in the tagging approach).
Small messages were of the order of 800 bytes. (For the Ethernet case today, these are fairly large packets with MTU of the order of 1500 bytes). This is attributed to the hardware and well may not be a flaw in their approach.
This approach offers enough flexibility to be misused. (TCP-non-friendly stacks become easier to write at the application level. The kernel controlled misbehavior by hosts becomes difficult and may potentially increase the processing time).

Performance:

The paper is all about maximizing utilization of resources in the form of bandwidth and software processing. The zero-copy mechanisms avoid unnecessary copying, applications manage their own buffers which are less scarce then kernel buffers leading to better management for a carefully designed application. The extra level of indirection in software of going to the kernel is removed. The increased flexibility in using the network interface can lead to custom designed protocols offering increased performance.

Summary
This paper describes an attempt to increase network utilization by decreasing processing overhead. The general idea is to allow user-level processes access to the networking interface. This approach avoids using the kernel and offers a more flexible development environment (don't need to muck with the kernel).

Problem
As the authors state, the processing overheads introduced by calling the kernel to handle the network stack limit your overall network utilization, particularly in a "high-speed LAN." The overhead makes performance especially abysmal when the majority of your data is small messages trying to be sent as quickly as possible. In existing implementations, you'd have to call the kernel for each message!

Additionally, having the kernel do your network processing decreases the flexibility of the system. By removing the networking stack from the kernel and making it available to user-level processes, the barrier to creating specialized communications protocols is effectively destroyed.

Contributions
Decreasing latency by removing the networking stack from the kernel. To do this in a sensible way, the authors have come up with the idea of endpoints. Endpoints allow user-level processes to behave as though they own the network interface. This is essentially just a handle for the U-Net implementation to use to communicate directly with a process. Applications that do not know about endpoints function normally, and the kernel must be modified to support emulated endpoints. This allows for transparent support of U-Net. Endpoints also allow for a sort of protection mechanism, as an application can only ever use its own endpoint(s).

Direct-Access U-Net is a pretty interesting idea, though it does not appear as though it was implemented at the time this paper was written. The general idea is that to be truly efficient, U-Net should try to avoid any sort of intermediate buffering ("true zero copy"). Data would be sent directly to and from application data structures. The authors state that direct-access was not implemented due to bandwidth limitations of their network.

Flaws
This is perhaps wishful thinking, but U-Net's particular interest in small messages implies the authors had RPC in mind. It would've been nice to have seen benchmarks of a U-Net RPC implementation.

It seems to me as though U-Net would introduce its own overheads. If all an application wanted to do was open up a UDP socket and send datagrams, you'd have to either recreate or link in some sort of UDP library.

As other people have said in their comments, it would've been nice to address U-Net's resource management.

Performance / Relevance
The authors provide a decent set of benchmarks for U-Net, and as desired, U-Net certainly improves latency and flexibility. The authors even went so far as to recreate TCP and UDP, both of which exhibit great improvements in latency and round-trip times. Pretty hard to argue with that, but I can't help but feel as though this general idea of taking things out of the kernel is a step backwards. Sort of along the lines of using assembly languages over a higher level language.

Summary:
The paper talks about U-net, a user-level, virtual view of a network interface bypassing most kernel abstractions provided by OS in order to gain performance improvement.

Problem:
As networks (especially local area networks) became faster, the bottleneck in network communication became the processing done before sending and after receiving the message by the network interface. This is because the messages go through multiple layers (and copying ) through different levels of abstractions provided by the kernel. U-net attempted to move all the network protocols into application space, so they can make use of the knowledge of applications behavior and/or share buffers with the applications (to avoid copies) and hence get much better performance.

Contributions:
- Moving kernel out of the way for all network communication, except for the initial setting up, This allows low latency communication especially for smaller messages.
- Moving all protocol implemetations to user-space, allowing the implementations to use application level knowledge and share buffers with applications, to avoid multiple copying in network layers
- Virtual network interface that gives processes the illusion that it owns the network interface
- Claims supported through practical implementation and experiments.

Flaws:
- One of the major things missing in the paper is how the performance scales on a loaded network. With a loaded network, there will be more dropped packets and more retries, it will be interesting to see how the protocols implemented at user level cope with more buffering and multiple retries.
- From the paper it looks like u-net requires network hardware support to gain significant performance gains. For that kind of a hardware, implementing a network protocol inside also wont be big deal. So I am thinking such a hardware and a thin kernel module may be a better choice for extreme-performance networking.

Relevance:
The u-net achieves what it wanted to - reduce the processing latency for messages. It considers (and proves correctly - at least in some cases) that kernel is more of an obstacle than abstraction for fast networks. But the fact remains that all operating systems have most of the protocol implementation in kernel still.

Summary:
This paper proposes a mechanism that virtualizes the network interface at the user-level called “U-Net” which enables user to communicate in distributed environment with low-latency, efficient bandwidth, and flexible protocol and programming interface.

Problem:
In an environment that requires low-latency and high-bandwidth for small sized packets such as distributed computing environment, the cost of going through the kernel such as copying packet many times and a general purpose programming interface for network was a bottleneck of the performance.

Contributions:
In this paper, the virtual network was not only implemented as an emulated mechanism but also was implemented on real network adapters so that could function in a native speed. The point they used a “off-the-shelf” network interface card and computers were really important to show how the mechanism is reasonable and powerful.

The virtualization and multiplexing the interface enable to provide isolation between each processes and also, when there are too many services trying to use U-Net the two level implementation of U-Net (native and emulate) enables to use the service at two level depending on the importance of each client service.

The evaluation was done by measuring in various situation and the authors did not say that the system is perfect and analyzed fairly on how U-Nets have maintained the requirements. And they successfully removed the overhead of communication by simplifying the message path by removing kernel from the path and showed the flexibility of the mechanism by implementing example protocols.

Flaws:
Since it is easy to have a big number of end points and each end point will be competing each other, to give a fair or safe multiplexing, it might be better to have a mechanism that manages the end points lower than the application.
The choice of ATM doesn’t seem to be the best for today’s environment, but since the system is for very local and high bandwidth network, it might be fine.

Performance:
The approach of removing the common kernel path at that time did really good on the performance and the way they did was impressive. I felt some similar view from the exokernel and LRPC about providing high performance and flexibility by removing prefixed general mechanism.

Summary

The authors propose U-Net communication architecture to provide efficient low latency communication by removing the kernel from the communication path and to offer high degree of flexibility by moving the entire protocol stack to the user space.

Motivation

Due to the increased speed of high speed LAN, bandwidth bottleneck has been shifted from network fabric to the software path traversed by messages at sending and receiving ends. Secondly, placing protocol processing in kernel makes it quite difficult to support new protocols and message send/receive interfaces. The authors solve these problems by proposing that the kernel should be removed from the communication path and the protocol stack should be implemented in user space.

Contributions

Major contributions of this paper are as follows.
1. The authors propose the use of endpoints to communicate with the U-Net network. As a result the kernel is completely removed from the communication path. This results in improving the communication latency. U-Net also supports emulated endpoints that are serviced by the kernel. As a result they consume no additional network interface resource but they cannot offer same level of performance as regular endpoints.
2. Since U-Net advocates removing the kernel from the communication path, it has to implement its own protection policy. U-Net enforces protection boundaries among multiple processes accessing the network using endpoints and communication channels. As a result, an application cannot interfere with the communication channels of another application on the same host.
3. U-Net implements �true zero copy�. As a result, data can be sent directly out of the application data structures without intermediate buffering. Similarly, the network interface can transfer arriving data directly into user-level data structures. This optimization also improves the communication latency.
4. In order to show the flexibility of U-Net architecture, the authors implemented TCP and UDP modules for U-Net. Evaluation studies show that U-Net based implementations perform significantly better than the traditional implementations and scale pretty well with the increase in packet size.

Flaws
Authors did not explain the architecture of user-level network interface very clearly. For example, it is not clear how the network resources are managed incase of a large number of regular endpoint requests. Are their any priorities assigned to a subset of applications? Are the using some other method to fairly allocate regular endpoints to the applications?

Performance

The authors are trying to achieve threefold goal. Firstly, they are trying to improve the round trip communication latency. Since, increasing latency helps bandwidth, they are also positively affecting the bandwidth. Secondly, the authors are trying to achieve high bandwidth incase of small packet communication by achieving low communication overhead. Finally, since the protocol stack is implemented in user-space, applications can achieve further speedups by integrating application specific information into protocol processing.

Summary
To support more flexibility while reducing the processing overhead of sending messages over a network a new user-level transport architecture is proposed. The basic architecture is discussed and the implementations of several protocols built on top of the new architecture are compared.

Problems Addressed
Using more traditional communication architectures processing overhead for each message sent is relatively high especially when the messages are small. This means that the network bandwidth may not be fully utilized due to a processing bottleneck at the interface. This work addresses this by attempting to reduce the amount of processing required for each message. Also the authors feel applications could benefit from a more flexible architecture that could be optimized for particular applications. To support this the authors remove the network processing from the kernel and move it into user space. This paper is thus similar to some previously read in class that attempt to determine the optimal set of components to include in the kernel and what set to provide in user space.

Contributions
In order to reduce the processing overhead for a message transfer the kernel is removed from the critical path which eliminates the system call overhead and allows for buffers to be managed efficiently. Hardware is proposed that multiplexes the physical network interface among the various processes. This provides a virtual network interface for each process and give the process the illusion that it has sole ownership of the network interface. Processes interacting with the network create endpoints that have memory regions setup to hold messages for sending. These messages are then accessed directly by the network interface and are sent over the network. As in the LRPC paper a lot of the processing overhead is removed by reducing the number of times data must be copied within the system. Ideally data should be copied once into the endpoint segment and then sent over the network. Similarly when receiving, data should be brought in by the network interface and placed in the segment for a particular endpoint. The process associated with the endpoint can then directly access the data.

Flaw
It was mentioned that the number of endpoints the the network interface has direct control over is limited and if all are used virtual endpoints are created that are controlled by the kernel. These virtual endpoints have reduced performance due to kernel overhead and it was not clear how the endpoints are distributed to processes if there are a large number of processes requiring network resources. I could imagine in such cases as virtual machines there could be a number of processes all vying for endpoints.

Performance
The primary area of performance addressed here is that of latency. It was the main motivation behind the work to reduce the amount of overhead to process messages and improve the round trip latency of a message. This was accomplished both by removing a lot of the message processing from the kernel and the introduction of a method for buffer management where the amount of copying of the data is significantly reduced.

Summary
The paper advocates direct user access to the networking stack in order to allow the user to best handle transmissions and receptions. The network interface is replaced by a virtual network interface that simply multiplexes access.

Problem
The standard networking stack does not handle small messages well and does not allow the flexibility for users to implement their own policies.

Contributions
The article begins by describing a number of situations where small packet transmission latencies greatly affect application performance. It makes the point that the message overhead for the many small packets transferred is quite significant and could be reduced.
They claim placing the network stack in user control reduces system call overhead and allows better buffer management, which some applications could exploit.

The authors present the idea of a virtual network interface that multiplexes access between many processes. The virtual interface provides isolation by attaching an ID to each packet, which demarks which virtual interface should receive the packet. The ID in this system is the same as the ATM virtual identifiers. The system is tested with Active Message benchmarks for both ATM and IP traffic. U-net is shown to have comparable performance to other optimized ATM interfaces and was able to efficiently interact with IP.

Flaws
I would have liked to have the resource allocation algorithm explained further. I assume it was some kind of round robin based on packet or bit count. In a heavily loaded system this could be an interesting parameter to modify.

They mention kernel emulation to reduce the number of endpoints needed. I assume this was required to keep the U-Net overhead to a minimum for applications that could not benefit from the improved flexibility. As many applications might need to use the emulated U-Net endpoint it would have been interesting to see what overhead this emulation adds to the calls that use it.

Finally the reliance on ATM for the packet ids makes this system less practical in today�s world of IP. It would have been nice if they would have even suggested a method of creating message tags for other protocols.

Performance
The system improved both latency and efficiency. Decreased latency from reduced per packet processing times and increased efficiency because the full capacity of the network link could be utilized.

U-Net: A User-Level Network Interface for Parallel and Distributed Computing [von Eicken, et al.]

Summary:

This paper presents a communications architecture that can be used to
avoid kernel system call and buffer copying overhead by virtualizing the
network interface and providing mux/demux for more direct user-level
access to the network.

Problem:

The problem U-Net addresses is that common operating systems were not
able to fully realize the performance that higher speed networks (such
as ATM OC-3 and fast ethernet) because of the processing overhead in
commodity operating systems' kernels.

Contributions:

* U-Net assists convenient development and protected operation of
RPC or asynchronous RPC messaging protocols by enabling the
construction of very low-latency protocols to run in user processes.

* The performance validation using a real parallel application (i.e. the
Split-C benchmarks) clearly demonstrates the improvements U-Net
offers, at least for operations typical in parallel processing
clusters.

* The authors definitely "think different", considering even the
potential advantage of doing compiler-assisted protocol development,
with all protocols compiled together and run in user-space rather than
some in kernel and some in user-space.

Flaws:

* Since the use of ATM was never popular at the network "edge" (instead
hybrid ethernet and ATM techniques such as ATM LAN Emulation were used
c. 1995), their implementation's reliance on the ATM (layer-2) VPI/VCI
to demux traffic and deliver it to the right process is not practical.
Today, on ethernet, one would have to resort to inspection beyond
layer-2 MAC address to demux, since application endpoints can't be
identified by just the layer-2 header as with ATM.

* It isn't clear, especially some 10+ years after this work,
if simply network stack tuning or loading these new protocols into the
kernel could acheive similar "small message" performance for some
apps.

* Since this writing, industry has provided protocol-specific (TCP)
offload network interface cards to shorten the critical path for
application network traffic. It seems this is another viable option,
especially combined with some sort of DMA from the card. Or perhaps
it would be sufficient to just address the kernel/application buffer
integration issue by improving existing APIs.

Performance Relevance:

The performace impact of this U-Net architecture is specifically
directed to local area scaled-out clusters of machines exchanging small
messages. The technique's primary value is to decrease the latency,
while retaining reliability, by keeping potentially suspect
application-specific network stack code within a user process.

Post a comment