LFS
i. Q1
1. A.needed to mention threads
2. B: any reasonable answer. Large difference: address spaces vs. per-page protection
ii. Q2:
1. layer of indirection not the same as a layer; indirection means you have some choice, can add policy
2. Caching …
3. Policy/Mech separation: not really in Pilot; it split things along a different boundary (kernel/manager) but mechanism was in both places
iii. Q3
1. Scheduler activatiosn: can’t reuse thread on server, request comes up on activation
2. U-Net: put A-stack in endpoint, need to handle thread scheduling on server
iv. Q4
1. A: Reference + access bits. Implemented as type safe pointer, index in table (not a system that implemented it),
a. NOTE: opal did not use random numbers. They include a portal in the number + random check digits.
2. How: 2 users with different access w/o root involvement
3. Access control: need to control propagation
v. Interfaces
1. Changed
a. Scheduler Activations
b. Opal
c. RPC
d. ExoKernel
e. Mach
2. Didn’t change but should have
a. VMware: ask OS for memory
b. Mach: could have used LRPC for communication instead of ports / messages
i. Fast, but subsequent re-read not as good
i. A: sequential writes -> sequential layout
i. A: what if push to app, let app ask for durability (e.g. fsync)
i. 99% of live data is blocks + indirect blocks
ii. 13% of data written is metadata blocks (inode, inode map, summaries, etc.)
i. Heuristic: when disk has been idle for 2 seconds
i. Is a cost, may take extra blocks to update indirect blocks but generally (if more than one block of a file in a segment) not too bad
ii.
i. Replace free list with bitmap; can be done because of faster CPU
i. Has reliability problem with crashes, because moving a file between a directory causes multiple things to be changed; may crash in middle
ii. At low disk utilization
i. CPU speeds getting faster relative to disk
1. QUESTION: What is implication? Can do more work per disk block to make good decisions
ii. Memory sizes increasing with CPU speed
1. More of read traffic satisfied in read
2. Writes still go to disk for reliability
3. QUESTION: Is this true?
a. On dept Linux machines, 30 reads/sec, 40 writes/sec
iii. Interesting workloads have lots of small files
i. Why synchronous I/O?
1. For metadata
2. Preserves consistency of directories, inodes, etc.
3. e.g. update free block bitmap before inode
ii. Decouple CPU from disk speed by removing need for programs to wait for disk
iii. Move burden of ensuring durability to application, which must request it via sync() and fsync() instead of being the default.
iv. QUESTION: What can be async?
1. Hard for program to wait for read to complete; requires new programs
2. Can buffer write requests
i. Much cheaper than random I/O – order of magnitude more efficient
ii. FFS spreads data around
1. Inode separate from file
2. Directory separate from file
i. Lots of small files ( < 8kb)
ii. Sequential, complete (in entirety) access
iii. Average file lives < ½ day
iv. Other workloads can be handled by other mechanisms
i. Creating a file in two directories causes 8 random writes, half (metadata) are synchronous
i. Cleaner removes duplicate, dead data from log
i. How choose length? Time to write > time to seek so can ignore seek costs between segments
i. Finding data on disk
ii. Making space for data as log grows
i. Inode map gives log location (segment number, maybe offset) from inode number
ii. Gives version number (so can detect if file deleted / overwritten)
iii. Gives last access time
i. Lots of log entries will be invalid; files deleted or overwritten
ii. Threading: Leave valid data in place (like FFS), write log into wholes
1. Problem: degrades performance
iii. Copy and compact (like a GC): coalesce data from partially used segments into a smaller number of new segments
1. Problem: finding segments to clean?
2. Problem: when do you clean?
iv. LFS solution: break log into segments, thread between segments, copy & compact within segments
i. How to clean:
1. Read N segments, each utilization u
2. Write out N*u segments of data
3. Have N*(1-u) clean segments
ii. Record statistics about each segment in segment summary
1. Identifies each piece of information in a segment (e.g. file number and block number
2. Summary used to determine liveness of blocks – see if latest inode / indirect block references this block
3. Version number reduces overhead for overwritten / deleted files
4. Result: no free list of block bitmap, no consistency problems resulting from these during recovery
iii. Policies: which segments should you clean?
1. Framework: evaluate based on write cost
a. How much is the disk busy for writing a byte of data, including cleaning overheads?
i. 1.0 == data written once
ii. 10 == read/write 10x bytes / operations (about where FFS is)
iii. == # of bytes moved to and from disk / # of bytes of new data written
2. Result: lower utilization system have lower overhead – don’t have to clean as much
iv. Idea 1: clean blocks with low utilization
1. Get lots of cleanliness out, because not much live data.
2. E.g. u=0.2; read 10 blocks, write 2 new, get 8 free
3. Result: want bimodal distribution
a. High-utilization in some places for efficiency storage
b. Low utilization elsewhere for efficient cleaning
v. Idea 2: hot and cold
1. Hot blocks written frequently
2. Cold blocks written infrequently
3. Goal: coalesce cold blocks into a few segments that don’t need to be cleaned
vi. Policy 1: greedy
1. Clean the lowest utilization blocks
2. Problem: Utilization of cold blocks drop slowly till hovers just above cleaning limit
vii. Idea 3: hot vs cold
1. QUESTION: What is benefit of cleaning a cold segment?
a. High – long term space is retained, don’t need to move things around again. Once cleaned, stays clean for a long time
2. What is the benefit of a cleaning a hot segment?
a. Low – will need to clean again soon, wasted time cleaning because was going to die anyway – might as well let more blocks die
viii. Policy 2:
1. Weight utilization by age of data (most recent modification time of any block in segment)
2. Result:
a. Clean cold segments up to 75% utilization
b. Clean hot segments only at 15% utilization
i. QUESTION: How long does it take?
1. Not long – a second or so to clean a few segments
ii.
i. State-of-the-art research?
ii. Commercial products
i. QUESTION: When use synthetic micro-benchmarks?
1. For understanding performance
ii. QUESTION: When use benchmarks like LADDIS, PostMark, Andrew
1. For comparing across published results
iii. QUESTION: When use traces?
1. For getting more representative workloads
2.
i. Must fill a segment to be efficient, single user may not fill it
ii. Need lots of cache to avoid slow reads
i. 34% performance drop on TPC
ii. WHY?
1. delay to do cleaning
2. lack of DB locality leads to many fairly-full blocks
i. Reduces caches size
ii. Can’t evict a page to make more space; need additional pages to hold its metadata for a while
i. New data layout when cleaning disks
ii. Hole-plugging at high utilization instead of cleaning
i. Writes can be hidden by delayed writes
i. People don’t use LFS; cleaning costs make disks unpredictable, makes performance really suck
ii. Big win comes from metadata updates, not data
1. Modern OS journals metadata, writes data at a good place
iii. Big win comes from async writes; commonly used
iv. Big Win comes from better choice of inode location