CS 736 - Spring 2007 - Paper Discussion: Lightweight Remote Procedure Call

Summary:

The paper describes Lightweight RPC (LRPC), a communication facility designed and optimized fro communication between the protection domains on the same machine. Optimized LRPC enables the system design choice effectively for a small kernel due to the large-grained protection model of RPC.

Problem:

The RPC model can be used on the same machine to provide security by using the large-grained protection model provided by them. Traditional RPC design is sub-optimal for performance reasons on a single machine. LRPC wants to address this.

Contributions:

The paper chooses to optimize the system design with RPCs enabling a smaller kernel. The authors identify the common case with (inter-domain) (over the same machine) RPCs - that most calls are simple with simple parameters being passed and simple return values. They optimize for these cases.
The thread control transfer is a smart technique where the client's thread "continues" execution in the requested procedure in the server domain.
Data transfer is highly optimized for. The number of copy operations for arguments is reduced greatly compared to message passing etc (usage of A-stacks).
The bind-time optimizations to reduce the overhead at run-time (pre-allocation of A-blocks, etc) are also well thought of.
Domain caching on multi-processors works well as the context-shifts are transparently done. (The performance improvement does not seem significant in the paper's results.)

Flaws:

The optimizations which are specific to Modula 2 implementation - Like the separate stack for arguments and Execution may not be portable across different kernels. (Some other optimization may be identified for other languages/platforms).
The common case observations seem to be on the higher side than what one would expect. All the unix system calls considered to be RPCs is also not a good observation (Some calls can be truly handled by the kernel even in the small kernel approach too).

Performance:

The paper's main stress was performance optimization in RPCs occurring over a single machine. The common case was highly optimized for - concurrency, run time optimizations, copy optimizations, thread donation (Avoiding the kernel's discretion on when to schedule, etc), etc.

Posted by: Archit Gupta | March 6, 2007 12:39 PM

Summary
This paper presents a remote procedure call implementation that is more efficient than previous methods for communication between protection domains on the same machine.

Problem
The problem that the authors are trying to solve is that RPC is a convenient structure for programming and building systems that require multiple protection domains with the advantages including "modular structure, easy system design, implementation, and maintenance; failure isolation, enhancing debuggability and validation; and transparent access to network services." But this convenience comes at a cost of using methods primarily developed for cross-machine communication as part of one system.

Contributions
The primary contribution of this paper is the optimization of RPC for the case of multiple domains on the same machine while keeping the advantages of current (at the time of the paper) RPC systems, and the design decisions that make this possible.

Another important contribution of this paper is the discussion of use and performance of RPC on systems contemporary to the paper using instrumented versions of several contemporary operating systems in wide use. They show in this section that most RPC calls used are done within the same machine meaning that they "crossed protection, but not machine, boundaries"

Flaws
While the section that they wrote on justification was very useful and it was good to see measurements of RPC use in allegedly real systems, they don't discuss the workload involved or what applications are being used and whether these are typical use cases among all target users.

Relevance
This paper is relevant as RPC is still important today.

Performance and Evaluation
The evaluation and performance justification done in this paper is a step up from what we have seen in previous papers. The authors clearly demonstrate that their system performs better under what they believe is a standard workload.

Posted by: Aaron Bryden | March 6, 2007 09:49 AM

SUMMARY
In "Lightweight Remote Procedure Call" Bershard et al. present a new way to implement RPC that requires less processing overhead than traditional methods, called LRPC. Need for LRCP is motivated by studying common uses of RCP. LRCP is then evaluated and copared to other implementations.

PROBLEM
RPC is slow. That forces developers to compromise security for sake of performance by coalescing services into same domains.

CONTRIBUTIONS
Identifying the "common case" of RCP usage patterns and optimizing their implementation to it. In particular authors discover that most RPC calls do not cross machine boundaries and do not involve complex data structures.

Description, implementation and evaluation of LRPC

FLAWS
I am a bit weary about how stable the "common case" assumptions can be. For example, perhaps RPC overhead is the very reason they rarely made across different machines and involve simple data structures. By making RPC efficient, the authors may have made it much more attractive for wide use, thus underminding their own premise of what is common.

Perhaps Null calls aren't necessarily the best indicator of future performance. For example, I would think that null-calls are very cache-friendly, which might cause underestimation of latency of RPC.

Not really a flaw, but I am not sure that data about what was common in 1990 is necessarily applicable today.

PERFORMANCE
Authors focus on time-to-completion in the common case as the performance metric and clearly demonstrate that LRPC performs well in that regard.

Posted by: Vladimir Brik | March 6, 2007 09:41 AM

Summary
This paper presents a remote procedure call implementation that is more efficient than previous methods for communication between protection domains on the same machine.

Problem
The problem that the authors are trying to solve is that RPC is a convenient structure for programming and building systems that require multiple protection domains with the advantages including "modular structure, easy system design, implementation, and maintenance; failure isolation, enhancing debuggability and validation; and transparent access to network services." But this convenience comes at a cost of using methods primarily developed for cross-machine communication as part of one system.

Contributions
The primary contribution of this paper is the optimization of RPC for the case of multiple domains on the same machine while keeping the advantages of current (at the time of the paper) RPC systems, and the design decisions that make this possible.

Another important contribution of this paper is the discussion of use and performance of RPC on systems contemporary to the paper using instrumented versions of several contemporary operating systems in wide use. They show in this section that most RPC calls used are done within the same machine meaning that they "crossed protection, but not machine, boundaries"

Flaws
While the section that they wrote on justification was very useful and it was good to see measurements of RPC use in allegedly real systems, they don't discuss the workload involved or what applications are being used and whether these are typical use cases among all target users.

Relevance
This paper is relevant as RPC is still important today.

Performance and Evaluation
The evaluation and performance justification done in this paper is a step up from what we have seen in previous papers. The authors clearly demonstrate that their system performs better under what they believe is a standard workload.

Posted by: Aaron Bryden | March 6, 2007 08:49 AM

Summary:
This paper describes the implementation of Lightweight RPCs, which is an implementation of RPCs tuned for intra-machine communication.

Problem:
Traditional RPC implementations had a lot of overhead that is not necessary for intra-machine calls, but this was the common case.

Summary:
* A recognition that there sholud be an optimized form that would take advantage of the fact that most calls are local
* A description of a way in which this could be implemented
* A semantics for the call that was the same from the programmer's perspective (except for the program you run to generate stubs) from traditional RPCs, which in turn was almost the same as the semantics for standard machine calls.

Flaws:
There were some things that the original RPC paper discussed that this one didn't. The big one that comes to mind is exceptions. Traditional RPCs handle Mesa exceptions, but does their LRPC implementation support them? If not, would changing them to support exceptions be easy? Hard?

I'm also not a fan of them calling all system calls RPCs on Unix in their motivation section... I felt that was pushing it.

Performance:
They were all about the latency (/time-to-completion). All their timing charts were how long the LRPCs took or about things that affect how long they take. (Though they did mention throughput at one point, I suspect it's just that a higher possible throughput would be possible if the latency was lower.) They improved the latency by reducing the number of operations on the critical path, such as copying data.

Posted by: Evan Driscoll | March 6, 2007 08:46 AM

Summary:
The paper discusses about a lightweight implementation of RPC, that achieves huge performance improvements over RPC for cross-domain procedure call(i.e. procedure calls made within the same machine) with simple datatypes.

Problem:
The authors found that over 97% of all Remote Procedure Calls made are cross-domain(i.e. processes within the same machine), than cross-machine. They also found that a majority of them used simple arguments. LRPC was an attempt to exploit these to achieve performance improvement.

Contributions:
- Extending protected procedure calls from capability systems to optimize cross-domain RPC, which they found was an "overwhelmingly" common case.
- Passing arguments without multiple copying, unlike RPC. LRPC maps A-stacks on server and client domains, allowing them both to access the same copy of arguments directly.
- Use of independent A-stack queue locks (instead of a global lock) to achieve fewer waiting, and hence better scaling up on multi-processor systems

Flaws:
- The initial measurements that convinced the authors the need for LRPC sounds a little skewed. I doubt the environment selected was a true representative of workstations. If you pick a machine where there is not much of networking, of course all traffic would be local. Also, there was no analysis made on which aspect of RPC is making it slow (the marshalling/demarshalling, network, copies).

- The stub-generator creates code that works with LRPC and RPC, and a check is made on each call to decide which branch to execute. This allows dynamic changes in server locations, but requires more code size and overhead for the check (which may not be a problem for true remote calls, but could be for local calls). Another approach could have been to generate only one code, based on a compile time switch.

- Authors talk about recovering from a dead-locked/indefinite server thread. A timeout would have been a simple mechanism to handle the scenario.

Relevance:
I think the basic idea - of optimizing RPC implementation to achieve better performance on local cross-domain procedure calls - is very relevant. Abstracting the optimization in the generated stubs allows application developers to focus on clean design and implementation, rather than worry about performance and make design implementation tradeoffs to achieve the same.

Posted by: Base Paul | March 6, 2007 07:58 AM

Summary
This paper describes a RPC implementation that is specifically optimized for communicating across protection domains on a single machine. The end result is essentially a leaner version of traditional RPC (due to optimizing for localhost communication), thus Lightweight Remote Procedure Call.

Problem
LRPC was created because measurements of RPC connections were taken and the results made it clear that the overwhelmingly common case is a machine communicating with itself. Additionally, the authors demonstrate that most RPC calls do not really make use of complex or large parameters. The common usage of RPC plainly illustrates that the additional overhead is not needed.

Contributions
A decent start into investigating the performance characteristics of RPC?

Using idle processors to store information seems like a pretty novel thing for [L]RPC to be doing.

LRPC binding is a fair bit different than RPC binding. In LRPC binding, we see that instead of using a network, memory is used to pass call information back and forth.

Flaws
This might be an unreasonable nit, but have RPC usage patterns changed since 1990? If so, perhaps LRPC wasn't sufficiently general... Additionally, using only three operating systems to sample data seems a bit questionable.

It's a bit disappointing that the authors neglect to prove LRPC piping truly cross-machine communication is nearly as efficient as RPC.

As mentioned previously, a more in-depth analysis of what makes LRPC so much more responsive would be interesting.

Performance / Evaluation
This paper is actually has some nice evaluation in it! The reader can plainly see that LRPC can give a performance increase of 3x (and higher).

Posted by: Jon | March 6, 2007 07:51 AM

Summary:
This paper proposes a communication facility called Lightweight Remote Procedure Call which is optimized for cross domain communication in a machine by simplifying the mechanisms of RPC.

Problem:
RPC was proposed as a communication mechanism for cross machine communication, but at that time, most of the RPC called were used for communication inside the machine. Since RPC was designed for cross machine communication, it was too complicated and had too much overhead as a communication method inside the machine. So, a method with all simplicity, efficiency, performance, and easiness were demanded.

Contributions:
First of all, the authors have improved the performance dramatically by redesigning the architecture and mechanism of procedure call. Communication overhead was solved by simplifying the steps of transferring the message (arguments) between client and server. Binding is one of the mechanism which supports reducing the steps by accomplishing some registration before starting communication.
They used shared memory space to pass the messages as a message board and also used Binding Object to provide an access control. As a result of reducing the number of times transmitting and copying the data and other achievements made the procedure call 3times faster.
The authors also proposes a method to improve the performance especially on multi-processor environment by keeping the context (whole domain�s environment) on the processor when idling so it could reduce the number of context switching as much as possible.
LRPC has provided simplification, performance and safety at the same time. It also mentions the side effect of the mechanism of LRPC but did not solved it in the paper.

Flaws:
There are some cases using Null procedure call to measure the performance. It might be useful when measuring a theoretical minimum time, but it might be more interesting if there were more performance evaluation on situations or cases based on real life and information of provided stub interface and call interface. The real life example might also support why using A-stack with a fixed size it acceptable.

Performance:
Improving the throughput by making the procedure and interfaces minimum which fulfills the requirements seems to be successful. The idea of keeping the environment (context) on the processor seems to fit in current machines with multiple cores, but still the number of core are small and it seems to be difficult to fully gain the merit of that mechanism. It tells us that analyzing how the system is used and what is the needs at the current time for optimization and re-designing is very important and useful.

Posted by: Hidetoshi Tokuda | March 6, 2007 12:50 AM

Summary

In this paper, the authors propose Light Remote Procedure Calls for efficient cross-domain communication. LRPC uses capability based approach for control transfer and communication and large-grained protection model of RPC. Simulation results show that LRPC based model can be three times faster than the traditional approaches for communication between protection domains.

Problem Description

In small kernel operating systems, communication between protection domains on the same machine dominates the overall communication. Using the traditional RPC based model incurs unnecessary overhead for communication between protection domains in the same machine. As a result of this cost, system designers combine weakly related subsystems into the same protection domain, compromising security for performance. In order to overcome this problem, the authors propose LRPC for efficient communication across protection domains on the same machine. Since, communication between the protection domains on the same machine is the common case, the authors are trying to make it quite efficient.

Contributions

Some of the contributions of this paper are as follows.
1. The authors did initial studies to characterize the frequency of cross-machine communication, measure size of data being communicated, and calculate the time taken cross-domain RPC. The authors based their model on these studies. As a result their LRPC model was quite sound and showed promising speedups. This approach teaches us the importance of extensive study of the underlying problem before development of a solution for it.
2. A procedure is represented as a call stub in the client's domain and an entry stub in the server's domain. LRPC stub blurs the boundary between the protocol layers. This can result in improving the performance of LRPC because overhead involved in crossing the boundaries is reduced.
3. The authors have made a pretty nice effort to identify the common case which is the cross-domain communication on the same machine. The authors are trying to make this common case fast as proposed by Amdahl's law. Secondly, for the uncommon case i.e. cross-machine communication, the authors propose using the traditional RPC based model. As a result the performance of the uncommon case is not affected. As of result of this methodology, the overhead involved in cross-domain communication is significantly decreased.

Flaws

The authors discuss that the high overheads in conventional RPC can be attributed to a number of factors like stub overhead, message buffer overhead, access validation, etc. but they do not provide any evaluation study that shows how much is the percentage of overhead because of any one of these factors. Secondly, the authors show that overall LRPC performs 3 times better than the conventional RPC but again they do not evaluate which factors of LRPC contributed what amount to the overall speedup. These studies would give much more insight to the actual problem and might have shown if there are any other opportunities for improvements.

Performance

The authors are trying to make the common case fast by improving the performance of cross-domain communication on the same machine. By making the common case fast, the authors are in essence making the system more responsive and are also improving the throughput of the system. Secondly, using the LRPC model, the developers need not coalesce weakly related subsystems into the same protection domain, thus making the system more secure.

Posted by: Atif Hashmi | March 6, 2007 12:05 AM

Summary
Lightweight remote procedure calls are RPCs designed for efficient cross-domain communication. The common case optimization was motivated by the observation that most RPC calls are not to other machines.

Problem
RPCs are heavily used for cross-domain communication on the same machine, but they are not designed or optimized for this.

Contributions
The authors present an analysis of three machines RPC use and discover that less than 5% of the time does the RPC cross machine boundaries. Next the delay of an RPC call is broken down into individual components: stub overhead, message buffer overhead, access validation, message transfer, scheduling, context switching, and dispatching. A number of implementation decisions and tricks are presented in order to attempt to minimize the execution time for the most common RPC calls. Idle processors are used to further increase LRPCs� performance by caching the context. Performance analysis consisted of comparing the new LRPCs to the highly optimized RPCs on the Taos system.

Flaws
I would have liked to have seen more on the distribution of calls they used. It is unclear from the article if a single call was used in the performance evaluation or many different calls were used. A number of their optimizations are highly dependent on the type of call. If the same call were used over and over caching would be highly effective and only a single size argument stack would be needed.

An analysis of how much each optimization improved performance would have also been nice. It would be helpful for future work to know which of the implementation decisions had the largest impact and also facilitate comparison to other approaches and operating systems.

Performance
The main performance goal of the idea is responsiveness. The entire premise of the paper is to get quicker response times from RPC calls. Efficiency and scale-out are also improved.

Posted by: Kevin Springborn | March 5, 2007 10:24 PM

Summary
Based on research, indicating most procedure calls within a system are simple and cross domain as opposed to complex or inter-machine, a simple mechanism is proposed that is shown to significantly reduce procedure call time for simple calls. This has a big impact on overall system performance since most procedure calls use the simplified mechanism.

Problems Addressed
To improve procedure call performance the following four aspects of the mechanism were addressed. 1) Simplifying control transfer, 2) Simplifying the data transfer, 3) Simple stub generation, and 4) Allow for optimizations in a concurrent environment.

Contributions
The binding of clients and servers is done so that a lot of the work required to make a connection is carried out before a request is actually made. This reduces the time to service a request since a lot of checking required by a procedure call is already completed. During the binding process a "Binding Object" is created that is basically a capability to a certain exported server interface. Once the client holds the binding object, access to the service is very easy and quick since minimal security checks need to be made. During the creation of the binding object shared memory regions are setup that provide space for passing parameters. The caller's parameters, to be consumed by the server, are first put in the shared memory region, the server then reads the parameters from the shared region and processes the request. The return parameters are then put back into the same shared memory region ready for the client to read. The memory region is shared by both the client and the server and thus parameters need not be copied multiple times to pass them between the client and server. This reduction in time required to pass parameters seems to have the largest affect on increasing procedure call performance. An optimization was also proposed for multi-processor machines that contain idle processors. On a context switch from a client to a server, CPU's are swapped if there are idle processors available. This allows the client to maintain its state on it's processor while the server runs on the previously idle CPU eliminating the need for a context switch.

Flaw
The system is built on the fact that the kernel can distinguish between authentic binding objects and forged objects. This fact was mentioned in the article however it was not explained how this is achieved and what makes a binding object distinguishable from a forged object.

Performance
It seems as though the efficiency of resource consumption was the primary performance goal of this work. This was accomplished by using simple shared memory for parameter passing, creating a binding object used as a capability granted to a client, and allowing for a reduced number of context switches on multi-processor machines.

Posted by: Nuri Eady | March 5, 2007 08:08 PM

Paper Review: Lightweigth Remote Procedure Call [Bershad, et al.]

Summary:

This paper outlines many optimizations to Remote Procedure Calls (RPC)
to execute them much faster when they are intra-machine. (The authors
generalizes RPC to mean both cross-[protection-]domain, including both
intra-machine and truly remote extra-machine.) They show significant
performance gains by making RPCs lightweight primarily by having the
client's (caller's) thread execute the "remote" call in server's domain.

Problem:

The problem was that RPCs had significant performance issues due to
argument copying and scheduling overhead. This encouraged software
system designers to restrict their designs and force subsystems into the
same domain, compromising modularity and safety of their designs to
avoid even intra-machine RPCs.

Contributions:

* The authors were guided by the observation that most communication
traffic is intra-machine, cross-domain and is simple in form. This
allowed them to focus on performance improvements to an existing IPC
mechanism by optimizing for the "common case".

* The insight of seeing cross-domain but intra-machine calls as
"Remote" procedure calls, thus allowing the unification of most
intra-machine IPC with RPC semantics. This has a configurability
benefit (over say, sockets) in not requiring the applications
themselves to prepare data as network-ordered bits on the wire, so
that this overhead can be avoided when the client and server are
colocated on the same machine.

b>Flaws:

* I found their evidence for most (>99%) of RPCs being intra-machine to
be scanty.

Specifically, the authors treated all unix system calls as if they
were RPCs (because they can technically be considered cross-domain).
For Unix, it would have been more appropriate to report what
percentage of socket-based calls are truly remote. Today, one might
evaluate this for say, a web server, albeit that wasn't a typical
application for the time of this writing.

Also, their cross-domain RPC measurements are for obscure research
operating systems, and do not specify the applications they were
running nor measure many instances of each system.

Performance Relevance:

If operating systems pervasively use RPCs, the improvements here-in
would improve throughput, responsiveness, and time to completion for
various tasks. It also has some potential protection reliability
benefits because it reduces incentive to make design compromises.

Posted by: Dave Plonka | March 5, 2007 05:33 PM

Summary
The paper describes the components and network protocol of a well performing remote procedure call architecture.

Problem
The authors wanted to be able to make remote procedure calls across a network with a quick and simple interface to facilitate distributed computing.

Contributions
The process of making a remote procedure call was abstracted to behave very similar to a local procedure call. The paper presented three layers of abstraction: application, stub, and RPCRuntime components. The user code calls a stub automatically created by Lupine. The stub handles packing and unpacking messages and hands off data to the RPCRuntime, which manages network transmissions.

A central grapevine database manages matching callers with callees and restricts those who can export a service. The Grapevine allows for three levels of destination specification when conducting a RPC: general call type, specific call instance, specific network address. This allows callers to be dynamically matched with suppliers.

The paper also presented a number of implementation details and a network protocol used to increase performance of RPCs. The transport protocol is highly tuned for small packet transmissions, ex. ACKing every packet. An idle process pool is used on the server in order to avoid process creation and teardown delays. Subsequent requests are directed towards the same process using the process identifier. Allowing one less ACK in the common case.

Flaws
I would have liked to have seen more of a justification for the binding location mechanism. Possibly due to my lack of distributed programming experience, I do not see why three levels of specification are needed. I would imagine that both the client and server applications would be written by the same author (the 100 fold increase in execution overhead makes small remote calculations impractical). I can imagine many cases where a static IP (specified at either compile time or on startup) is all that will be needed to locate the exporter. If both the client and server are written by the same person it seems a more efficient resource locator could be created in the case it is needed. I’m not convinced the benefits of a general purpose resource locator outweigh the overhead imposed.

Performance
The paper is attempting to improve scale-out by easily allowing remote procedure calls over the network. Which allows the application author to leverage more machines easily.

Posted by: Kevin Springborn | February 28, 2007 06:10 PM

CS 736 - Spring 2007 - Paper Discussion

Lightweight Remote Procedure Call

Comments

Post a comment