CS 736 – Spring 2006

LRPC

Questions

How stable is “common case”?

i. ANSWER: look forward at what it could be used for, e.g. system calls?

Didn’t address exceptions
No analysis of what was the slow part – is that necessary
Dynamic switching of stubs – is it necessary? Could it be done at compile time?
Didn’t compare cross machine performance.

i. COMMENT: a good evaluation explains things well, but leaves you wanting more because you are interested. Nobody evaluates everything. A bad evaluation leaves you confused and wanting more because you didn’t understand things.

Only compared 3 operating systems
Not sure whether general enough

i. Comment: how much more general could it be? Often there is a natural performance cliff where the next piece of complexity causes a big performance hit.

How do you tell an authentic binding object?

i. answer: it can be a table index, just like a file descriptor

ii.

Why use RPC for structuring a system

Easy to use compared to alternatives – compiler handles most of details
Easy to build a protected subsystem
Allows moving components out of kernel if fast enough
More reliable
Easier to extend
Faster RPC makes it possible to structure systems differently; brings up issue of evaluating new capabilities

Overview

General approach:

i. Analyze a system

ii. Find an untapped opportunity; some common behavior that can be optimized

1. E.g. small arguments

2. Fixed size arguments

3. Unstructed arguments (e.g. buffers vs. types needing marshalling)

4. Unnecessary optimizations (e.g. copying data)

iii. Measure overhead that you could remove; best case performance

iv. Build an optimized version that takes advantage of the opportunity

v. Go on to fame and fortune

Opportunities

i. RPC used for structuring systems:

1. Client / server (e.g. Windows services, name server)

2. NFS file server – used for sending requests to server

ii. Common case is not remote / large arguments

1. Common case is local calls when used in systems with micro-kernels (1-6%)

2. Common case is small, fixed-size arguments

a. 60% were < 32 bytes

b. 80% of arguments fixed size at compile time

c. 2/3 procedures have fixed-size arguments

QUESTIONS: how legitimate is this study? Look both at existing microkernel systems + future use (e.g. system calls)
QUESTION: why not look at socket applications? Could look at domain sockets (local sockets). Internet applications relying on RFCs aren’t gioing to convert to RPC

How LRPC works

Approach

i. Do everything in advance

1. e.g. allocating stacks

2. e.g. setting up dispatch (no dynamic dispatch in server)

ii. Remove unnecessary copies

1. Memory copies huge cause of performance problems

Doing things in advance

i. Bind

1. Create procedure description list with each exported procedure

2. Allocate shared a-stacks and corresponding linkage records (for caller’s return address) for each procedure

a. QUESTION: Why allocate stacks for all procedures?

i. ANSWER: want contiguity for easy range checking

3. Return binding object to client runtime to identify binding

4. KEY POINT: binding object contains server function address – no need for dispatch in server

ii. Call

1. QUESTION: How do they know if call is local or remote?

a. A: at bind time, cache a bit of information

Copy avoidance

i. Client stub grabs A-stack off queue

ii. Push arguments on A-stack

iii. Pass a-stack, binding object, procedure identifier in registers to kernel

iv. Kernel

1. Verify binding, procedure identifier

2. Locates procedure description

3. Verify A-stack & locate linkage for A-stack

4. Verify ownership of A-stack

5. Record caller’s return address in linkage

6. Push linkage onto thread (so can nest calls)

7. Find execution stack (e-stack) for server to execute (from pool)

8. Update thread to point at E-stack

9. Change processor address space

10. Call into server stub at address in PD

v. Notes:

1. Can use separate argument stack because language supports in, C would need to copy arguments to E stack

a. What else could you do? Put a-stack/e-stack on attached pages

2. By-reference objects are copied to A stack by client stub

a. PRINCIPLE: client does copying work

b. PRINCIPLE: client stub does work, kernel verifies (e.g. choose A-stack)

c. QUESTION: What are alternatives to having client stub do copying?

3. What about thread-local storage in server? Or thread-init routines for DLLs?

vi. QUESTION: What about writing with shared memory?

1. A: no isolation

Copying Safety

i. Normal RPC makes copy of arguments

1. Many times – up to 4 times

ii. QUESTION: What is benefit?

1. Ensures COW semantics; client changes can’t corrupt server

iii. LRPC uses shared stacks accessible to both processes

1. Client can overwrite A-stack while server access it

iv. Solution:

1. Server can copy/verify data only if needed

2. Not needed for opaque (e.g. buffers) parameters

3. Server can integrate validity checks with copying

4. Adds at most one extra copy (on top of initial 1)

5. COMMENT: More like a system call, where kernel validates parameters

6. ISSUE: Complicates server

Reliability

i. What do you do if a server thread crashes?

ii. Question: what is the key problem?

1. A: client thread has been taken over for the server, can’t just timeout because server is actively using it

iii. Solution: duplicate client thread state into a new thread

Evaluation:

Comments: good evaluation explains why the performance is better, doesn’t just show it is better
Example: was just on a PC meeting, one paper showed a 100x speedup. But, didn’t explain it. PC felt that they didn’t understand the system, because the code they explained didn’t justify a 100x increase. Result: paper dinged

i. In this paper: didn’t which pieces of RPC were bad

ii. Showing the minimum possible gets around this from the other direction

Commentary

Limitations:

i. Assumes no per-thread application state

ii. Relies on argument stack pointer to avoid copying / changing protection on execution stack

Idea used in Windows NT

i. Dave Cutler drove from MS over to UWash for a meeting

ii. Windows version different

1. No shared stacks

2. Pre-allocated shared memory if large objects needed

3. Handoff scheduling for low latency

4. Still have to copy messages many times

a. Into user-mode message

b. Into kernel mode message

c. Into server message

d. Onto server stack

5. Quick LPC:

a. Dedicated server thread

b. Dedicated shared memory with server thread

c. Event pair for signaling message arriving / reply arriving

How important is fast IPC?

i. Systems are never fast enough

ii. If code called frequently, always the temptation to move code into the kernel