Paper Discussion - Fall 2008 - CS 736: Lightweight Remote Procedure Call

Summary: The Lightweight Remote Procedure Call mechanism tries to address the inefficiencies in normal RPC usage in the common case where small footprint calls are made on the same machine. This is done by avoiding overhead associated to the execution path required for the complex but infrequent cases.

Problem: Small kernel OS aim at placing components in separate domains. An RPC like mechanism is used for communication. But this 'one-size fits all' approach doesn't account for the real usage pattern: although the bulk of communication is between domains on the same machine, and small-simple parameters are involved, the overhead for the heavyweight machinery (marshaling, access validation, scheduling, dispatch, etc) is still paid.

Contributions: This paper points to a real mismatch between design and usage pattern. They provide a communication mechanism designed for concurrency, that maintains domain protection at a lower cost by dispatching a shared argument stack directly to the server domain. Linkage records are used to facilitate the transfer of control. Security is enforced by Binding Objects, which are capabilities obtained at bind-time.

Techniques: The main technique is to optimize the common execution path. The LRPC caller still calls into a stub, but instead of a complicated dispatch mechanism, the stub simply uses the A-stack (shared with the callee) to pass arguments, no unnecessary data copying is done. A-stacks are preallocated. Execution-stacks on the server are not associated with A-stacks at bind time, rather dynamically, so that large E-stack space explosion is avoided. E-stacks are 'activated' directly by the kernel, no dispatching overhead. A-stacks can be shared between procedures in the same interface with similar size requirements. Stubs are generated in assembly.
Tradeoff: There is an upfron cost that is paid at bind time for setting up the A-stacks. Multiple A-stacks allow for concurrency, but the concurrency limit is bound at bind time. In the uncommon case, RPC like overhead (marshaling, Modula-stubs, etc) is still involved, so LRPC maintains two coexisting mechanisms.
(b) another part of the OS where the technique could be applied.

Weaknesses/Flaws: The shared stack optimization only holds for their language environment, it would not work for the more common case of C-like languages.
The mechanism fits well into a system that already uses RPC, where communication is done only through arguments/return values. It is unlikely that one could take it a step further, where local LRPC caller/callee would also share global data. In a way, they are not comparing apples to apples: the assembly generated stubs for LRPC increase performance of LRPC, while Modula stubs for RPC decrease the RPC performance; this blurs comparison of the rest of the mechanisms (which would be more accurate if assembly-generated stubs were also used for old RPC).

Posted by: Daniel Luchaup | October 30, 2008 10:28 AM

Summary: The Lightweight Remote Procedure Call mechanism tries to address the inefficiencies in normal RPC usage in the common case where small footprint calls are made on the same machine. This is done by avoiding overhead associated to the execution path required for the complex but infrequent cases.

Problem: Small kernel OS aim at placing components in separate domains. An RPC like mechanism is used for communication. But this 'one-size fits all' approach doesn't account for the real usage pattern: although the bulk of communication is between domains on the same machine, and small-simple parameters are involved, the overhead for the heavyweight machinery (marshaling, access validation, scheduling, dispatch, etc) is still paid.

Contributions: This paper points to a real mismatch between design and usage pattern. They provide a communication mechanism designed for concurrency, that maintains domain protection at a lower cost by dispatching a shared argument stack directly to the server domain. Linkage records are used to facilitate the transfer of control. Security is enforced by Binding Objects, which are capabilities obtained at bind-time.

Techniques: The main technique is to optimize the common execution path. The LRPC caller still calls into a stub, but instead of a complicated dispatch mechanism, the stub simply uses the A-stack (shared with the callee) to pass arguments, no unnecessary data copying is done. A-stacks are preallocated. Execution-stacks on the server are not associated with A-stacks at bind time, rather dynamically, so that large E-stack space explosion is avoided. E-stacks are 'activated' directly by the kernel, no dispatching overhead. A-stacks can be shared between procedures in the same interface with similar size requirements. Stubs are generated in assembly.
Tradeoff: There is an upfron cost that is paid at bind time for setting up the A-stacks. Multiple A-stacks allow for concurrency, but the concurrency limit is bound at bind time. In the uncommon case, RPC like overhead (marshaling, Modula-stubs, etc) is still involved, so LRPC maintains two coexisting mechanisms.
(b) another part of the OS where the technique could be applied.

Weaknesses/Flaws: The shared stack optimization only holds for their language environment, it would not work for the more common case of C-like languages.
The mechanism fits well into a system that already uses RPC, where communication is done only through arguments/return values. It is unlikely that one could take it a step further, where local LRPC caller/callee would also share global data. In a way, they are not comparing apples to apples: the assembly generated stubs for LRPC increase performance of LRPC, while Modula stubs for RPC decrease the RPC performance; this blurs comparison of the rest of the mechanisms (which would be more accurate if assembly-generated stubs were also used for old RPC).

Posted by: Daniel Luchaup | October 30, 2008 10:26 AM

Summary
This paper describes a lightweight remote procedure call specialized for calls between domains on a single machine.

Problem
Statistically, most remote procedure calls occur between domains on a single machine rather than between machines, most use fixed size parameters known at compile time, and most transfer a small number of bytes. How can we use this information to safely improve performance?

Contributions
The authors perform experiments and compile data to show that the majority of remote procedure calls use simple arguments with a small number of bytes transferred. Furthermore, the majority of remote procedure calls occur between domains, and there is indeed overhead involved in traditional remote procedure call systems. The authors’ lightweight remote procedure call mechanism shows that much of this overhead, specifically, within the stubs, can be avoided in the common case.

Specifically, traditional remote procedure call mechanisms involve four copies of procedure arguments (stub stack -> message in client domain -> kernel domain -> server domain -> server stack), even in remote procedure calls between domains on the same machine. The authors simplify this for cross-domain calls; they introduce a separate argument stack associated with an execution stack for the server stub. The client and server domains share this argument stack so that only a single copy of the arguments, onto the argument stack, is necessary.

Techniques
The authors are focused on achieving the best performance in the common case (remote procedure calls between domains with small, fixed parameters). In order to achieve the best performance in these cases, they are willing to let more complicated cases make use of the traditional remote procedure call mechanisms that already handle these cases.

Specifically, the lightweight remote procedure call stub produces stubs in assembly language that handle simple arguments well. With more complex parameters, Modula2+ code is produced that resembles traditional stubs. As well, the first instruction of a generated stub branches to a more traditional stub in the cross-machine case. In each example, the common case shows improved performance, while the less common case can still be handled using existing mechanisms.

Undoubtedly, there are other cases of facilities in systems which, while designed to work in the general case, are not optimized for the most common case or cases, and the general approach taken here would likely prove useful in increasing the performance of these facilities and systems.

Posted by: Sam Javner | October 30, 2008 09:26 AM

Summary
In “Lightweight Remote Procedure Call,” Brian N. Bershad et al. describe their implementation of the RPC communication mechanism. They attempt to improve RPC performance for communication across protection domains located on the same machine.

Problem
Communication via the RPC mechanism is much more resource intensive than simple procedure calls. This is because the calls need to go through the various mechanisms that marshal and unmarshal the calls so that they can be transmitted over a network. Bershad et al. found that many remote procedure calls do not actually go across the network. These calls, which are made across domains located on the same machine, still incur the overhead associated with regular RPC. Noting that in many cases this overhead is not strictly necessary, Bershad et al. show that RPC can be made more efficient by creating lightweight RPC, which would detect cross-domain calls on the same machine and optimize them.

Contributions
• Improving the common case of small, local remote procedure calls while maintaining the regular RPC semantics, having the system automatically detect if an optimization for a call can be made.
• Fully maintaining the security present in RPC for the lightweight calls. By making LRPC more efficient, programmers were less likely to use alternate methods and thus less likely to give up safety.

Flaws
• When evaluating the system, especially making the initial claims about RPC overhead, the authors do not provide details as to where the overhead for RPC comes from. When trying to optimize away the overhead, it would have been beneficial to know exactly where the overhead was coming from.
• Similarly, the claim that most RPC is done locally is not well supported. It would have been helpful to know more about the details of the type of workload that was used that led to this claim.

Performance Techniques
The primary motivation behind LRPC is addressing the common case and optimizing it. By noting that most remote procedure calls were both small and across domains on the same machine, RPC was optimized to ensure that the overhead pertaining to these calls was minimized. By concentrating on the most common case and making it fast, performance was improved for most calls. This technique is commonly used in computer architecture and is known as Amdahl’s law. Architects always try to ensure that the most common path of execution is performed most efficiently.

Posted by: Mark Sieklucki | October 30, 2008 08:28 AM

Lightweight Remote Procedure Call

This paper introduces Lightweight Remote Procedure Calls, a RPC used for systems with a small kernel to communicate between different protection domains. It is based on the use of capabilities to provide large-grain protection. Previous solutions for this problem using traditional RPC for inter machine communications result in a higher overhead than desirable. This facility is implemented into Taos, the OS for the Firefly multiprocessor and shows and improvement of 3 over traditional message passing.

They demonstrate that most communications in OS are between domains in the same system and imply small data transfer. The goal of LRPC is to optimize these type communications without sacrificing security.

They propose taking the same semantics model than RPC. A similar binding mechanism is used to make the caller get the address of the server. In this case each procedure has a procedure descriptor. The kernel keeps a list of the procedures in the system each entry contains a PD with the address for it and the size of the stack that they can use. Each client receives a Binding object, which has to be checked by the kernel in every call and is how the client accesses the server.

Simple control transfer is one of the mechanism that provides improvement over RPC. The caller thread is executed in the server domain. This allows to eliminate scheduling and dispatching.

Simple data transfer is useful for these type of calls where simple data is passed. The caller copies its data in a shared stack, similar to what is done for procedure calls. The client has a different stack for each of the procedures. This eliminates the need of coping the parameters.

The mechanism implemented are considering the case of multiprocessor systems with shared data. This is done by not having shared data structures. Multiprocessor system may provide an even bigger improvement in latency than uniprocessors.

The ways of improving performance are optimization of the common case by using simple control transfer mechanism, simple data transfer, simple stubs and the possibility of using this facility for multiprocessor systems. They increase the performance for simple calls by sacrificing the performance for more complex calls.

Posted by: Paula Aguilera | October 30, 2008 07:54 AM

Summary:
The paper describes Lightweight RPC (LRPC), a communication facility designed and optimized fro communication between the protection domains on the same machine. LRPC adopts a common-case approach to communication, exploiting, simple control transfer, simple data transfer, simple stubs, and multiprocessors LRPC performs well for the majority of cross-domain procedure calls by avoiding needless scheduling, excessive run-time indirection, unnecessary access validation, redundant copying, and lock contention
The goal the paper was trying to deal with:
While employing RPC technique for cross-domain communication would thus result in loss of performance and loss of structure. Therefore, the paper implements LRPC, which is a communication facility designed and optimized for communication between protection domains in the same machine, simplifies aspects of RPC and is used in small-kernel operating systems to avoid cost incurred by using RPC
Contributions
1． One of the contributions is the paper implements LRPC, which simplifies aspects of RPC and can improve the performance on communication between protection domains in the same machine.
2． By finding that most RPC calls do not cross machine boundaries and do not involve complex data structures, the paper identifies the "common case" of RCP usage patterns and optimizing their implementation on it.
3． Unlike RPC, LRPC passes arguments with avoidance of unnecessary copying, RPC maps A-stacks on server and client domains, allowing them to access the same copy of arguments directly.
Flaws:
One of the flaws in the paper is that it assumes there is no per-thread application state. Another flaw is that although the high overheads in conventional RPC can be attributed to a number of factors like stub overhead, message buffer overhead and so on, no any evaluation study that shows how much is the percentage of overhead because of any one of these factors is provided. The third flaw is that there is much dependence on argument stack pointer to avoid copying on execution stack.
Techniques used:
Four techniques are used, simple control transfer helps client’s thread to execute the requested procedure in server’s domain; Simple data transfer helps shared argument stack to eliminate redundant data copying; Simple stubs is to do generation of highly-optimized stubs; Design for concurrency helps to avoids shared data structure bottle-necks.
Tradeoffs:
RPC can serve the communication needs of both local and remote subsystems, but local communication is treated as an instance of remote communication, and simple operations are considered in the same class as complex ones in RPC. LRPC may have no ability of remote communication, but LRPC achieves a level of performance for cross-domain communication that is significantly better than conventional RPC systems, while still retaining their qualities of safety and transparency.
Another part of the OS where the technique could be applied:
Technique of copy avoidance can be used to substantially reduce overhead. For example, results from copy-avoidance in the IO-Lite system [V. S. Pai, P. Druschel, W. Zwaenepoel "IO-Lite: a unified I/O buffering and caching system"], which aimed at improving web server performance, show a throughput increase of 43% over an optimized web server, and 137% improvement over an Apache server.

Posted by: Tao Wu | October 30, 2008 07:36 AM

Lightweight Remote Procedure Call (LRPC)

Summary

The authors first establish that most of the RPC calls are inter domain within a same machine and not inter machine and that RPC mechanisms that existed did not perform well for this common case. Next, they optimize each of the reasons for overhead in conventional RPC and develop the LRPC which performs 3x better in most cases.

Description of the problem being solved

The problem is to optimize the traditional RPC for the common case of inter domain calls rather just reusing the techiniques used for inter machine calls. Specifically the authors are trying to optimize the overheads in stubs, message buffers, access validation , message transfer , scheduling , context switch and dispatch overheads in the conventional RPC.

Contributions of the paper

1) Novel and simple control transfer mechanism : LRPC allows client's threads to execute the requested procedure in the server's domain. This simple method optimized the scheduling and context switch overheads of conventional RPC.

2) Simple data transfer mechanism : The shared Argument stack is a simple and efficient model for data transfer yet is made secure by kernel validation during RPC calls.

3) Simple stubs: Since the data transfer and control in LRPC are simple, it allows LRPC to generate highly optimized stubs.

4) Idle processors for concurrency : LRPC minimized use of shared data structures and also caches domains on idle processors.

Flaws in the paper

1) The paper does not provide enough data or motivation to believe that processors are idle enough to such an extent that a special optimization of 'domain caching in idle processors' will be useful. Also, the data provided in the performance evaluation section to demonstrate the usefulness of this optimization is a simulated workload than a real workload.

Techniques used to achieve performance

1) Optimizing common cases and making them faster.
2) Using idle resources to do some pre-work before even the work is generated so that time can be save by not executing the pre-work again. Specifically, in this paper, idle processors are used to cache domains.
3) Zero copy data transfer by sharing buffers ( A-stacks).

Tradeoff made

1) Maintaining extra state information to get performance improvment ( space vs time) . Extra details of most frequent RPC calls are kept so that idle processors can be hinted to cache the corresponding domains.

2) Relaxing security constraints slightly for performance. For example, there is a possibility that client or server might continue to change the A-stack once the control has transfered to the other domain. We believe on mutual trust between domains involved in LRPC to address this.

Another part of OS where this technique could be applied

1) Technique of optimizing for common cases can be and should be applied to almost all systems. Caching hardware is a example. Caches make common cases faster at the cost of extra hardware.

2) Zero copy data transfer techniques can be used in all places where data needs to be transfered between two people on the same machine. For example, it was used in the exchange heaps in the Singularity operating system.

3) Using idle resources speculatively is also a widely used technique. For exmample, it can be used in thread libraries to make idle processors be ready for frequently used threads when they are idle.

Posted by: Leo Prasath Arulraj | October 30, 2008 07:16 AM

This is test comment.

Posted by: Mike Swift | October 30, 2008 07:13 AM

Paper Discussion - Fall 2008 - CS 736

Lightweight Remote Procedure Call

Comments

Post a comment