VMWARE

Notes from reviews:
-----
What if hash collides? A: use chaining; check page completely

Want more information!
- see ASPLOS paper
- see OSDI networking paper

Random sampling: is it good enough? ANSWER: look at results!

-----

NOTE FOR FUTURE REVIEWS: A summary of the (a) what kind of performance
is being improved, and (b) a high-level approache to improving
performance
NOTE: No relevance needed

-----


Virtual machine overview

- use SW to implement privileged machine interface
- allows multiple OS to run on a single machine
- e.g. when change page table register, trap into software -- VMM or
hypervisor that changes software instead, perhaps real hardware in a
different way.
- vmm interposes on all access control of hardware: e.g. privileged
instructions, i/o. Reimplements these in software. E.g. for a block
device, translates block number. Like virtual memory
- OS running in a vm is called a "guest"

QUESTION: why not add blades instead of running 10 VMs on a host?
ANSWER:
- management costs lower on a single node, (e.g. fewer things to break)
- can have workloads that require only a fraction of a blade
- encapsualtes state for migration, save & restore, checkpoint, resume
- simplifies mgmt (e.g. no hardware dependencies in OS)

QUESTION: How much is slowdown of virtual machine? Where does it come
from?
ANSWER:
SPECint: 0-10%
SPECjbb: 1-2%
Compile, apache, 2d graphcs: 40-80%
OVERHEAD COMES FROM:

- every entry into kernel (trap, system call) costs 2x to enter VMM
first

- all privileged instructions (e.g. cli, load tss, iret, modify pte)
cause overhead


VM Memory management:
- guest OS sees "physical" addresses, that it puts in a page table
- VMM translates "physical" addresses into "machine addresses" that
the HW understands.
- Essentially, double translation
- Optimization: stick VA -> MA translation directly in page table (as
a cache), on fault consult VA -> PA -> MA translation
- shadow page tables in VM hold PA -> MA translation
- Key point: can change PA->MA translation in VMM: can share pages,
swap out pages

VM uses

- server consolidation: take many apps that were on different servers
and put them on one. Saves management costs, power, computers

- run multiple OS: can run an OS for one application, or for testing

VM styles

- hosted: runs on top of existing OS, uses OS services to allocate
memory and I.O. E.g. send() for sending a packet, read() for reading
a block off disk. Benefit: uses existing driver, co-exist with
native environment for speed

- Native: no underlying OS. VMM provides own drivers, own everything

QUESTION: why the choice? Native faster but coexists less well --
nothing runs at native speed, e.g. graphics drivers

- Pure virtualization: run unmodified OS. Anything OS does must be
handled by VMM. E.g. OS updates page table, VMM must capture and
update real page table

- Paravirtualization: modify OS to avoid tricky cases. E.g. have OS
make calls to update va->pa mappings rather than edit page
table. E.g. have special drivers that call VMM rather than emulating
existing HW.

QUESTION: why the choice? Pure virtualization more flexible, can run
anything. Paravirtualization faster, easier to implement

VMWARE PROBLEM:

- Want to statistically multiplex OS, so can have efficient memory
usage (avoid cost of buying lots of memory)
- Can't control how guest OS manages memory directly; must tell it
that it has a fixed amont of memory (e.g. no hotplug / hotremove)

GOAL:

- Vary mahine memory allocated to a VMM
- separate policy (how much memory, what memory) from mechanism (how
do you get pages)

Mechanisms: for removing / adding pages

Policies: for determining how much memory a VMM should have, when to
reclaim, how much to reclaim (or grant)

----------------------------
Mechanisms:
----------------------------

1. Worst case: paging. Can select pages from a VMM and swap them to
disk
QUESTION: why so bad? used by OS
ANSWER: incomplete knowledge of what pages should be taken. E.g. might
interfere with working set policy of OS -- page not used, but
because process not scheduled but will be soon, or page is pinned
in OS for good reasons.
ANSWER: double-paging. VMM may swap a LRU page, then OS swaps same
page -- double cost, single benefit

QUESTION: VMWARE uses random paging. WHY? Is it a good or bad thing?

2. Better case: ask OS to give back memory. Balooning


Key idea: allocate physical pages in OS, give them to VMM

IMPLEMENTATION: write a kernel driver (or could be usermode) that
allocates pages and pins them in memory.

QUESTION: does VMM have to swap them?
A: no, baloon driver owns pages, doesn't care about contents

QUESTION: why does this work?
A: OS will either give memory off free list to driver, or will remove
other things from memory to give to driver. E.g. driver has higher
priority on memory than existing uses


KEY OBSERVATION: pages are explicitly reclaimed from a VM/OS. Not
chosen at random. Unlike global page replacement policies,
e.g. clock. VMM has to ask an OS for a page, or swap a page from an
OS.


3. Best case: share pages

Idea: if two pages have same contents, only need to store one copy.
Example: all the pages that are zero filled
Example: two VMs run same kernel, same binaries --> probably text
segments have same bits

BIG PROBLEM: how do you find duplicate pages? compare all pages?

ANSWER: build an index

IMPLEMENTATION:
1. pick candidate pages
2. hash contents, search hash table
3. On hit: compare full pages, set up COW
QUESTION: why bother comparing if hashes match?
ANSWER: hash is small enough that collisions could exist
QUESTION: why not make hash bigger?
QUESTION: tradeoff memory overhead for comparison cost on false
match

4. On miss: add page as a "hint" - something that could be shared

Questions:
1. what pages do you pick? A: random
2. How many, how often? A: some rate mechanism. Trade overhead for
earlier detection of sharing. E.g. 100 pages / 30 seconds.

REMEMBER: need more memory to improve performance, so not worth
spending too much time finding memory if it overrides the performance
gains


NOTE: HOW IMPORTANT IS THIS? 7-18%? Why not just buy more RAM?

------------------------
POLICY
------------------------

Big question: how do you allocate pages between virtual machines?
- global vs. local policy?

PROBLEM: Working set for an OS not quite like for an application;
application's don't adapt to changing memory sizes but OS does

PROBLEM: often have a desired performance goal for an OS, or priorties
for OS, want some minimum performance.

--> drives towards local policy

VMWARE POLICY: proportional share (we'll see this later)

Key idea:

- some pool of resources R
- Want to allocate fractions of it to different users
- would like a minimum guarantee, but efficient use of excess capacity

Solution:
- give each user a set of shares, like stock shares in a company
- value of a share is #shares / total # shares -- this is minimum
guarantee
- At any time, amount of resource is # shares / total # shares
demanded

Idea: under heavy use, get strict proportion. Under light use, can get
more in proportion to others who want more and their shares

Way to think about it: everybody who wants a resource buys lottery
tickets with shares. Winner picked at random from all shares bid. If
not need, don't buy tickets


So: under full demand by everyone, all pay same price per page: shares
/ pages granted. When not everybody has full demand, some with fewer
shares will get more pages


RECLAMATION: when pages needed, search for VM that is paying the least
for its memory (e.g. got some memory when others didn't want it.)

Algorithm: dynamic min-funding revocation.

Example

VM 1: 100 shares
VM 2: 100 shares

Total memory: 400 mb

VM 1 starts running, acquire 256 mb for 100 shares
price = 100/256 = 0.4

VM2 starts running, gets remainder: 144 MB for 100 shares
price = 100/144 = 0.69

When VM2 wants more memory, it comes from VM1

Now VM1 has 200 MB, VM1 has 200 MB, both pay same price - in equilibrium


NOTE: reclamation is kind of expensive; need to activate balloon or
swap pages.

QUESTION: is this the right policy? It doesn't guarantee timeliness,
just a minimum. NOTE: Real problem is not minimum guarantee, but how
to efficiently use memory above that.

---------
PREVENTING UNDERUSE OF MEMORY:

Problem: OS may have memory it is not using, e.g. free list or pages
not being referenced (e.g. non-pageable kernel pages that aren't
referenced). Could be better used in another VM.

QUESTION: does this problem arise in a normal OS?
ANSWER: yes, but handled by normal working set or clock algorithm -
unreferenced pages get replaced

QUESTION: why different in a VMM?
ANSWER: don't want to take a specific page (leave that to OS), but
want to measure memory usage and reclaim any page (with balloon)

SOLUTION:
Tax on idle memory
Concept: charge more for unused memory than used memory -- represents
a lost opportunity

Tax = t
Normal cost = 1
Taxed cost = 1/(1-t)
t = 0 --> taxed cost = 1, normal cost (counts as one page)
t = 0.5 -> taxed cost = 2, one idle page counts as two used pages
t = 1 --> taxed cost = infinite (counts as infinite pages)


Shares per page = shares / pages * (frac-used + taxed-cost*frac-idle)


Example: tax rate - 0.5, 50% of pages idle, 100 shares, 100 pages

t = 0 rate = 100/100*(0.5 + 0.5) == 1
t = 0.5 rate = 100/100*(0.5 + 1) = 0.66
t = 1 rate = 100/100*(0.5+inf) = 0

Result: when pages needed, rate will look lower for those needing
pages

VMware choice: 0.75 ->

----------

Detecting idle fraction:
QUESTION: how different than determining idle pages in an OS?
ANSWER: want to avoid bad interaction with OS page maangement
SOLUTION: randomly scan 100 pages every 30 seconds per vm, use as a
statistical estimate. Can by making page invalid, catching trap in VMM

QUESTION: from scan rate, how do you compute idleness?
QUESTION: how do you handle fluctuation?

ANSWER: Exponentially Weighted Moving Average:
new value = x*last sample + (1-x) old value

How choose x?
- high x weights towards recent values, responds quickly
- low x weights history more, takes a while to respond

QUESTION: what do you want in this situation?
ANSWER: quickly respond to needs for more memory, slowly handle
decrease.
HOW DONE: pick max of slow + fast average (plus one for current
period)


----------------------
Other policies

- Admission control: ensure that only run if you have enough memory
(memory for all min, memory+swap for all max)

- Usage levels to trigger behavior:
- balooning
- swapping pages
- suspend a VM