LRPC

  1. Questions
    1. How stable is Ňcommon caseÓ?

                                              i.     ANSWER: look forward at what it could be used for, e.g. system calls?

    1. DidnŐt address exceptions
    2. No analysis of what was the slow part – is that necessary
    3. Dynamic switching of stubs – is it necessary? Could it be done at compile time?
    4. DidnŐt compare cross machine performance.

                                              i.     COMMENT: a good evaluation explains things well, but leaves you wanting more because you are interested. Nobody evaluates everything. A bad evaluation leaves you confused and wanting more because you didnŐt understand things.

    1. Only compared 3 operating systems
    2. Not sure whether general enough

                                              i.     Comment: how much more general could it be? Often there is a natural performance cliff where the next piece of complexity causes a big performance hit.

    1. How do you tell an authentic binding object?

                                              i.     answer: it can be a table index, just like a file descriptor

                                             ii.      

  1. Why use RPC for structuring a system
    1. Easy to use compared to alternatives – compiler handles most of details
    2. Easy to build a protected subsystem
    3. Allows moving components out of kernel if fast enough
    4. More reliable
    5. Easier to extend
    6. Faster RPC makes it possible to structure systems differently; brings up issue of evaluating new capabilities
  2. Overview
    1. General approach:

                                              i.     Analyze a system

                                             ii.     Find an untapped opportunity; some common behavior that can be optimized

1.   E.g. small arguments

2.   Fixed size arguments

3.   Unstructed arguments (e.g. buffers vs. types needing marshalling)

4.   Unnecessary optimizations (e.g. copying data)

                                           iii.     Measure overhead that you could remove; best case performance

                                           iv.     Build an optimized version that takes advantage of the opportunity

                                             v.     Go on to fame and fortune

    1. Opportunities

                                              i.     RPC used for structuring systems:

1.   Client / server (e.g. Windows services, name server)

2.   NFS file server – used for sending requests to server

                                             ii.     Common case is not remote / large arguments

1.   Common case is local calls when used in systems with micro-kernels (1-6%)

2.   Common case is small, fixed-size arguments

a.    60% were < 32 bytes

b.   80% of arguments fixed size at compile time

c.    2/3 procedures have fixed-size arguments

    1. QUESTIONS: how legitimate is this study? Look both at existing microkernel systems + future use (e.g. system calls)
    2. QUESTION: why not look at socket applications? Could look at domain sockets (local sockets). Internet applications relying on RFCs arenŐt gioing to convert to RPC
  1. How LRPC works
    1. Approach

                                              i.     Do everything in advance

1.   e.g. allocating stacks

2.   e.g. setting up dispatch (no dynamic dispatch in server)

                                             ii.     Remove unnecessary copies

1.   Memory copies huge cause of performance problems

    1. Doing things in advance

                                              i.     Bind

1.   Create procedure description list with each exported procedure

2.   Allocate shared a-stacks and corresponding linkage records (for callerŐs return address) for each procedure

a.    QUESTION: Why allocate stacks for all procedures?

                                                                                                    i.     ANSWER: want contiguity for easy range checking

3.   Return binding object to client runtime to identify binding

4.   KEY POINT: binding object contains server function address – no need for dispatch in server

                                             ii.     Call

1.   QUESTION: How do they know if call is local or remote?

a.    A: at bind time, cache a bit of information

    1. Copy avoidance

                                              i.     Client stub grabs A-stack off queue

                                             ii.     Push arguments on A-stack

                                           iii.     Pass a-stack, binding object, procedure identifier in registers to kernel

                                           iv.     Kernel

1.   Verify binding, procedure identifier

2.   Locates procedure description

3.   Verify A-stack & locate linkage for A-stack

4.   Verify ownership of A-stack

5.   Record callerŐs return address in linkage

6.   Push linkage onto thread (so can nest calls)

7.   Find execution stack (e-stack) for server to execute (from pool)

8.   Update thread to point at E-stack

9.   Change processor address space

10.                 Call into server stub at address in PD

                                             v.     Notes:

1.   Can use separate argument stack because language supports in, C would need to copy arguments to E stack

a.    What else could you do? Put a-stack/e-stack on attached pages

2.   By-reference objects are copied to A stack by client stub

a.    PRINCIPLE: client does copying work

b.   PRINCIPLE: client stub does work, kernel verifies (e.g. choose A-stack)

c.    QUESTION: What are alternatives to having client stub do copying?

3.   What about thread-local storage in server? Or thread-init routines for DLLs?

                                           vi.     QUESTION: What about writing with shared memory?

1.   A: no isolation

    1. Copying Safety

                                              i.     Normal RPC makes copy of arguments

1.   Many times – up to 4 times

                                             ii.     QUESTION: What is benefit?

1.   Ensures COW semantics; client changes canŐt corrupt server

                                           iii.     LRPC uses shared stacks accessible to both processes

1.   Client can overwrite A-stack while server access it

                                           iv.     Solution:

1.   Server can copy/verify data only if needed

2.   Not needed for opaque (e.g. buffers) parameters

3.   Server can integrate validity checks with copying

4.   Adds at most one extra copy (on top of initial 1)

5.   COMMENT: More like a system call, where kernel validates parameters

6.   ISSUE: Complicates server

    1. Reliability

                                              i.     What do you do if a server thread crashes?

                                             ii.     Question: what is the key problem?

1.   A: client thread has been taken over for the server, canŐt just timeout because server is actively using it

                                           iii.     Solution: duplicate client thread state into a new thread

  1. Evaluation:
    1. Comments: good evaluation explains why the performance is better, doesnŐt just show it is better
    2. Example: was just on a PC meeting, one paper showed a 100x speedup. But, didnŐt explain it. PC felt that they didnŐt understand the system, because the code they explained didnŐt justify a 100x increase. Result: paper dinged

                                              i.     In this paper: didnŐt which pieces of RPC were bad

                                             ii.     Showing the minimum possible gets around this from the other direction

  1. Commentary
    1. Limitations:

                                              i.     Assumes no per-thread application state

                                             ii.     Relies on argument stack pointer to avoid copying / changing protection on execution stack

    1. Idea used in Windows NT

                                              i.     Dave Cutler drove from MS over to UWash for a meeting

                                             ii.     Windows version different

1.   No shared stacks

2.   Pre-allocated shared memory if large objects needed

3.   Handoff scheduling for low latency

4.   Still have to copy messages many times

a.    Into user-mode message

b.   Into kernel mode message

c.    Into server message

d.   Onto server stack

5.   Quick LPC:

a.    Dedicated server thread

b.   Dedicated shared memory with server thread

c.    Event pair for signaling message arriving / reply arriving

    1. How important is fast IPC?

                                              i.     Systems are never fast enough

                                             ii.     If code called frequently, always the temptation to move code into the kernel