CS 736 – Spring 2006

LFS

Exam

Mean/median: 67
Many people skipped question 3 (composing mechanisms)
Notes on questions

i. Q1

1. A.needed to mention threads

2. B: any reasonable answer. Large difference: address spaces vs. per-page protection

ii. Q2:

1. layer of indirection not the same as a layer; indirection means you have some choice, can add policy

2. Caching …

3. Policy/Mech separation: not really in Pilot; it split things along a different boundary (kernel/manager) but mechanism was in both places

iii. Q3

1. Scheduler activatiosn: can’t reuse thread on server, request comes up on activation

2. U-Net: put A-stack in endpoint, need to handle thread scheduling on server

iv. Q4

1. A: Reference + access bits. Implemented as type safe pointer, index in table (not a system that implemented it),

a. NOTE: opal did not use random numbers. They include a portal in the number + random check digits.

2. How: 2 users with different access w/o root involvement

3. Access control: need to control propagation

v. Interfaces

1. Changed

a. Scheduler Activations

b. Opal

c. RPC

d. ExoKernel

e. Mach

2. Didn’t change but should have

a. VMware: ask OS for memory

b. Mach: could have used LRPC for communication instead of ports / messages

Review notes:

Everybody doing fine. Will get credit proportional to the number of reviews you do.

Questions

Cleaning makes it hard to mount on other systems? (Aaron)
Impact on middle-of-file writes?

i. Fast, but subsequent re-read not as good

How could perf. be as good as AFS?

i. A: sequential writes -> sequential layout

Assume crashes infrequent?

i. A: what if push to app, let app ask for durability (e.g. fsync)

Capacity overhead to store extra inodes, etc.?

i. 99% of live data is blocks + indirect blocks

ii. 13% of data written is metadata blocks (inode, inode map, summaries, etc.)

When should cleaning take place?

i. Heuristic: when disk has been idle for 2 seconds

Write cost of updating inode pointers?

i. Is a cost, may take extra blocks to update indirect blocks but generally (if more than one block of a file in a segment) not too bad

ii.

Reminder on project
Context

UFS -> FFS:

i. Replace free list with bitmap; can be done because of faster CPU

Standard mechanism: static layout, update in place, fixed inode locations

i. Has reliability problem with crashes, because moving a file between a directory causes multiple things to be changed; may crash in middle

ii. At low disk utilization

Opportunities:

QUESTION: What were technology trends enabling this?

i. CPU speeds getting faster relative to disk

1. QUESTION: What is implication? Can do more work per disk block to make good decisions

ii. Memory sizes increasing with CPU speed

1. More of read traffic satisfied in read

2. Writes still go to disk for reliability

3. QUESTION: Is this true?

a. On dept Linux machines, 30 reads/sec, 40 writes/sec

iii. Interesting workloads have lots of small files

Asynchronous IO

i. Why synchronous I/O?

1. For metadata

2. Preserves consistency of directories, inodes, etc.

3. e.g. update free block bitmap before inode

ii. Decouple CPU from disk speed by removing need for programs to wait for disk

iii. Move burden of ensuring durability to application, which must request it via sync() and fsync() instead of being the default.

iv. QUESTION: What can be async?

1. Hard for program to wait for read to complete; requires new programs

2. Can buffer write requests

Sequential I/O

i. Much cheaper than random I/O – order of magnitude more efficient

ii. FFS spreads data around

1. Inode separate from file

2. Directory separate from file

Engineering Workloads

i. Lots of small files ( < 8kb)

ii. Sequential, complete (in entirety) access

iii. Average file lives < ½ day

iv. Other workloads can be handled by other mechanisms

FFS/UFS bad for metadata operations

i. Creating a file in two directories causes 8 random writes, half (metadata) are synchronous

LFS Storage Layout

Treat disk as infinitely long log

i. Cleaner removes duplicate, dead data from log

Write data in units of a segment

i. How choose length? Time to write > time to seek so can ignore seek costs between segments

Buffer data in memory into a segment, write all at once
Issues:

i. Finding data on disk

ii. Making space for data as log grows

Finding Data

FFS: can calculate position from inode number
LFS: Inodes not at fixed locations; show up in log segments when file written
Solution: layer of indirection mapping inode numbers to segments

i. Inode map gives log location (segment number, maybe offset) from inode number

ii. Gives version number (so can detect if file deleted / overwritten)

iii. Gives last access time

When reading, can read segments or just blocks (segments act as prefetching)

LFS Cleaning

What do you do when you run out of space?

i. Lots of log entries will be invalid; files deleted or overwritten

ii. Threading: Leave valid data in place (like FFS), write log into wholes

1. Problem: degrades performance

iii. Copy and compact (like a GC): coalesce data from partially used segments into a smaller number of new segments

1. Problem: finding segments to clean?

2. Problem: when do you clean?

iv. LFS solution: break log into segments, thread between segments, copy & compact within segments

Choosing what to clean

i. How to clean:

1. Read N segments, each utilization u

2. Write out N*u segments of data

3. Have N*(1-u) clean segments

ii. Record statistics about each segment in segment summary

1. Identifies each piece of information in a segment (e.g. file number and block number

2. Summary used to determine liveness of blocks – see if latest inode / indirect block references this block

3. Version number reduces overhead for overwritten / deleted files

4. Result: no free list of block bitmap, no consistency problems resulting from these during recovery

iii. Policies: which segments should you clean?

1. Framework: evaluate based on write cost

a. How much is the disk busy for writing a byte of data, including cleaning overheads?

i. 1.0 == data written once

ii. 10 == read/write 10x bytes / operations (about where FFS is)

iii. == # of bytes moved to and from disk / # of bytes of new data written

2. Result: lower utilization system have lower overhead – don’t have to clean as much

iv. Idea 1: clean blocks with low utilization

1. Get lots of cleanliness out, because not much live data.

2. E.g. u=0.2; read 10 blocks, write 2 new, get 8 free

3. Result: want bimodal distribution

a. High-utilization in some places for efficiency storage

b. Low utilization elsewhere for efficient cleaning

v. Idea 2: hot and cold

1. Hot blocks written frequently

2. Cold blocks written infrequently

3. Goal: coalesce cold blocks into a few segments that don’t need to be cleaned

vi. Policy 1: greedy

1. Clean the lowest utilization blocks

2. Problem: Utilization of cold blocks drop slowly till hovers just above cleaning limit

vii. Idea 3: hot vs cold

1. QUESTION: What is benefit of cleaning a cold segment?

a. High – long term space is retained, don’t need to move things around again. Once cleaned, stays clean for a long time

2. What is the benefit of a cleaning a hot segment?

a. Low – will need to clean again soon, wasted time cleaning because was going to die anyway – might as well let more blocks die

viii. Policy 2:

1. Weight utilization by age of data (most recent modification time of any block in segment)

2. Result:

a. Clean cold segments up to 75% utilization

b. Clean hot segments only at 15% utilization

When to clean?

i. QUESTION: How long does it take?

1. Not long – a second or so to clean a few segments

ii.

Big picture ideas

Use known locality, at write time, to drive layout, rather than predicted locality (within a file and within a directory) as FFS does
Separate writing to disk and long-term layout (e.g. cleaning)
Take advantage of idle cycles for cleaning so can handle large bursts
Summarize information (e.g. segment summaries) for performance
Take advantage of dynamic (run time) locality instead of static (file system layout) locality
Layout metadata contiguously with data (e.g. inode next to data and indirect blocks).

Evaluation

QUESTION: What should you compare against?

i. State-of-the-art research?

ii. Commercial products

QUESTION: What workloads?

i. QUESTION: When use synthetic micro-benchmarks?

1. For understanding performance

ii. QUESTION: When use benchmarks like LADDIS, PostMark, Andrew

1. For comparing across published results

iii. QUESTION: When use traces?

1. For getting more representative workloads

Issues

LFS very good for metadata operations: create / delete (3-4x ffs)
Truly random access has bad performance
LFS has high CPU utilization due to extra data structures, cleaning
Depends on lots of memory, multiple users

i. Must fill a segment to be efficient, single user may not fill it

ii. Need lots of cache to avoid slow reads

Cleaner can cause problems for busy system at 80% utilization; cleaning is synchronous and blocks work

i. 34% performance drop on TPC

ii. WHY?

1. delay to do cleaning

2. lack of DB locality leads to many fairly-full blocks

Generally can run cleaner in background whenever have 2 seconds idle; pretty much never causes a disruption for engineering workload
LFS uses a lot of memory; cache + 4 segments for moving data & cleaning

i. Reduces caches size

ii. Can’t evict a page to make more space; need additional pages to hold its metadata for a while

Segment size: why 1 mb? Best size depends on both seek time and transfer speed
Opportunities:

i. New data layout when cleaning disks

ii. Hole-plugging at high utilization instead of cleaning

Long term trend: disks are pretty cheap. Disk performance dominated by disk read times even with a cache (because so much longer)

i. Writes can be hidden by delayed writes

Overall impact:

i. People don’t use LFS; cleaning costs make disks unpredictable, makes performance really suck

ii. Big win comes from metadata updates, not data

1. Modern OS journals metadata, writes data at a good place

iii. Big win comes from async writes; commonly used

iv. Big Win comes from better choice of inode location