VMWARE
Notes
from reviews:
-----
What if hash collides? A: use chaining; check page
completely
Want more information!
- see ASPLOS paper
- see
OSDI networking paper
Random sampling: is it good enough? ANSWER: look
at results!
-----
NOTE FOR FUTURE REVIEWS: A summary of the
(a) what kind of performance
is being improved, and (b) a high-level
approache to improving
performance
NOTE: No relevance needed
-----
Virtual
machine overview
- use SW to implement privileged machine
interface
- allows multiple OS to run on a single machine
- e.g. when
change page table register, trap into software -- VMM or
hypervisor that
changes software instead, perhaps real hardware in a
different way.
-
vmm interposes on all access control of hardware: e.g. privileged
instructions, i/o. Reimplements these in software. E.g. for a block
device, translates block number. Like virtual memory
- OS running in a vm
is called a "guest"
QUESTION: why not add blades instead of
running 10 VMs on a host?
ANSWER:
- management costs lower on a single
node, (e.g. fewer things to break)
- can have workloads that require only a
fraction of a blade
- encapsualtes state for migration, save & restore,
checkpoint, resume
- simplifies mgmt (e.g. no hardware dependencies in
OS)
QUESTION: How much is slowdown of virtual machine? Where does it
come
from?
ANSWER:
SPECint: 0-10%
SPECjbb: 1-2%
Compile,
apache, 2d graphcs: 40-80%
OVERHEAD COMES FROM:
- every entry
into kernel (trap, system call) costs 2x to enter VMM
first
- all
privileged instructions (e.g. cli, load tss, iret, modify pte)
cause
overhead
VM Memory management:
- guest OS sees
"physical" addresses, that it puts in a page table
- VMM
translates "physical" addresses into "machine addresses"
that
the HW understands.
- Essentially, double translation
-
Optimization: stick VA -> MA translation directly in page table (as
a
cache), on fault consult VA -> PA -> MA translation
- shadow page
tables in VM hold PA -> MA translation
- Key point: can change PA->MA
translation in VMM: can share pages,
swap out pages
VM uses
-
server consolidation: take many apps that were on different servers
and
put them on one. Saves management costs, power, computers
- run
multiple OS: can run an OS for one application, or for testing
VM
styles
- hosted: runs on top of existing OS, uses OS services to
allocate
memory and I.O. E.g. send() for sending a packet, read() for
reading
a block off disk. Benefit: uses existing driver, co-exist
with
native environment for speed
- Native: no underlying OS.
VMM provides own drivers, own everything
QUESTION: why the choice?
Native faster but coexists less well --
nothing runs at native speed, e.g.
graphics drivers
- Pure virtualization: run unmodified OS. Anything OS
does must be
handled by VMM. E.g. OS updates page table, VMM must capture
and
update real page table
- Paravirtualization: modify OS to
avoid tricky cases. E.g. have OS
make calls to update va->pa mappings
rather than edit page
table. E.g. have special drivers that call VMM rather
than emulating
existing HW.
QUESTION: why the choice? Pure
virtualization more flexible, can run
anything. Paravirtualization faster,
easier to implement
VMWARE PROBLEM:
- Want to statistically
multiplex OS, so can have efficient memory
usage (avoid cost of buying
lots of memory)
- Can't control how guest OS manages memory directly; must
tell it
that it has a fixed amont of memory (e.g. no hotplug / hotremove)
GOAL:
-
Vary mahine memory allocated to a VMM
- separate policy (how much memory,
what memory) from mechanism (how
do you get pages)
Mechanisms:
for removing / adding pages
Policies: for determining how much memory
a VMM should have, when to
reclaim, how much to reclaim (or grant)
----------------------------
Mechanisms:
----------------------------
1.
Worst case: paging. Can select pages from a VMM and swap them to
disk
QUESTION:
why so bad? used by OS
ANSWER: incomplete knowledge of what pages should be
taken. E.g. might
interfere with working set policy of OS -- page not
used, but
because process not scheduled but will be soon, or page is
pinned
in OS for good reasons.
ANSWER: double-paging. VMM may swap a
LRU page, then OS swaps same
page -- double cost, single benefit
QUESTION:
VMWARE uses random paging. WHY? Is it a good or bad thing?
2. Better
case: ask OS to give back memory. Balooning
Key idea: allocate
physical pages in OS, give them to VMM
IMPLEMENTATION: write a kernel
driver (or could be usermode) that
allocates pages and pins them in memory.
QUESTION:
does VMM have to swap them?
A: no, baloon driver owns pages, doesn't care
about contents
QUESTION: why does this work?
A: OS will either
give memory off free list to driver, or will remove
other things from
memory to give to driver. E.g. driver has higher
priority on memory than
existing uses
KEY OBSERVATION: pages are explicitly reclaimed
from a VM/OS. Not
chosen at random. Unlike global page replacement
policies,
e.g. clock. VMM has to ask an OS for a page, or swap a page from
an
OS.
3. Best case: share pages
Idea: if two pages
have same contents, only need to store one copy.
Example: all the pages
that are zero filled
Example: two VMs run same kernel, same binaries -->
probably text
segments have same bits
BIG PROBLEM: how do you find
duplicate pages? compare all pages?
ANSWER: build an index
IMPLEMENTATION:
1. pick candidate pages
2. hash contents, search hash table
3. On
hit: compare full pages, set up COW
QUESTION: why bother comparing if
hashes match?
ANSWER: hash is small enough that collisions could exist
QUESTION: why not make hash bigger?
QUESTION: tradeoff memory overhead
for comparison cost on false
match
4. On miss: add page as a
"hint" - something that could be shared
Questions:
1.
what pages do you pick? A: random
2. How many, how often? A: some rate
mechanism. Trade overhead for
earlier detection of sharing. E.g. 100 pages
/ 30 seconds.
REMEMBER: need more memory to improve performance, so
not worth
spending too much time finding memory if it overrides the performance
gains
NOTE:
HOW IMPORTANT IS THIS? 7-18%? Why not just buy more RAM?
------------------------
POLICY
------------------------
Big
question: how do you allocate pages between virtual machines?
- global vs.
local policy?
PROBLEM: Working set for an OS not quite like for an
application;
application's don't adapt to changing memory sizes but OS does
PROBLEM:
often have a desired performance goal for an OS, or priorties
for OS, want
some minimum performance.
--> drives towards local policy
VMWARE
POLICY: proportional share (we'll see this later)
Key idea:
-
some pool of resources R
- Want to allocate fractions of it to different
users
- would like a minimum guarantee, but efficient use of excess
capacity
Solution:
- give each user a set of shares, like stock
shares in a company
- value of a share is #shares / total # shares -- this
is minimum
guarantee
- At any time, amount of resource is # shares /
total # shares
demanded
Idea: under heavy use, get strict
proportion. Under light use, can get
more in proportion to others who want
more and their shares
Way to think about it: everybody who wants a
resource buys lottery
tickets with shares. Winner picked at random from all
shares bid. If
not need, don't buy tickets
So: under full
demand by everyone, all pay same price per page: shares
/ pages granted.
When not everybody has full demand, some with fewer
shares will get more
pages
RECLAMATION: when pages needed, search for VM that is
paying the least
for its memory (e.g. got some memory when others didn't
want it.)
Algorithm: dynamic min-funding revocation.
Example
VM
1: 100 shares
VM 2: 100 shares
Total memory: 400 mb
VM 1
starts running, acquire 256 mb for 100 shares
price = 100/256 = 0.4
VM2
starts running, gets remainder: 144 MB for 100 shares
price = 100/144 =
0.69
When VM2 wants more memory, it comes from VM1
Now VM1
has 200 MB, VM1 has 200 MB, both pay same price - in equilibrium
NOTE:
reclamation is kind of expensive; need to activate balloon or
swap pages.
QUESTION:
is this the right policy? It doesn't guarantee timeliness,
just a minimum.
NOTE: Real problem is not minimum guarantee, but how
to efficiently use
memory above that.
---------
PREVENTING UNDERUSE OF MEMORY:
Problem:
OS may have memory it is not using, e.g. free list or pages
not being
referenced (e.g. non-pageable kernel pages that aren't
referenced). Could
be better used in another VM.
QUESTION: does this problem arise in a
normal OS?
ANSWER: yes, but handled by normal working set or clock
algorithm -
unreferenced pages get replaced
QUESTION: why
different in a VMM?
ANSWER: don't want to take a specific page (leave that
to OS), but
want to measure memory usage and reclaim any page (with
balloon)
SOLUTION:
Tax on idle memory
Concept: charge more for
unused memory than used memory -- represents
a lost opportunity
Tax
= t
Normal cost = 1
Taxed cost = 1/(1-t)
t = 0 --> taxed cost =
1, normal cost (counts as one page)
t = 0.5 -> taxed cost = 2, one idle
page counts as two used pages
t = 1 --> taxed cost = infinite (counts as
infinite pages)
Shares per page = shares / pages * (frac-used +
taxed-cost*frac-idle)
Example: tax rate - 0.5, 50% of pages idle,
100 shares, 100 pages
t = 0 rate = 100/100*(0.5 + 0.5) == 1
t =
0.5 rate = 100/100*(0.5 + 1) = 0.66
t = 1 rate = 100/100*(0.5+inf) = 0
Result:
when pages needed, rate will look lower for those needing
pages
VMware
choice: 0.75 ->
----------
Detecting idle fraction:
QUESTION:
how different than determining idle pages in an OS?
ANSWER: want to avoid
bad interaction with OS page maangement
SOLUTION: randomly scan 100 pages
every 30 seconds per vm, use as a
statistical estimate. Can by making page
invalid, catching trap in VMM
QUESTION: from scan rate, how do you
compute idleness?
QUESTION: how do you handle fluctuation?
ANSWER:
Exponentially Weighted Moving Average:
new value = x*last sample + (1-x)
old value
How choose x?
- high x weights towards recent values,
responds quickly
- low x weights history more, takes a while to respond
QUESTION:
what do you want in this situation?
ANSWER: quickly respond to needs for
more memory, slowly handle
decrease.
HOW DONE: pick max of slow + fast
average (plus one for current
period)
----------------------
Other
policies
- Admission control: ensure that only run if you have enough
memory
(memory for all min, memory+swap for all max)
- Usage
levels to trigger behavior:
- balooning
- swapping pages
-
suspend a VM