LFS

 

  1. Exam
    1. Mean/median: 67
    2. Many people skipped question 3 (composing mechanisms)
    3. Notes on questions

                                              i.     Q1

1.   A.needed to mention threads

2.   B: any reasonable answer. Large difference: address spaces vs. per-page protection

                                             ii.     Q2:

1.   layer of indirection not the same as a layer; indirection means you have some choice, can add policy

2.   Caching …

3.   Policy/Mech separation: not really in Pilot; it split things along a different boundary (kernel/manager) but mechanism was in both places

                                           iii.     Q3

1.   Scheduler activatiosn: can’t reuse thread on server, request comes up on activation

2.   U-Net: put A-stack in endpoint, need to handle thread scheduling on server

                                           iv.     Q4

1.   A: Reference + access bits. Implemented as type safe pointer, index in table (not a system that implemented it),

a.    NOTE: opal did not use random numbers. They include a portal in the number + random check digits.

2.   How: 2 users with different access w/o root involvement

3.   Access control: need to control propagation

                                             v.     Interfaces

1.   Changed

a.    Scheduler Activations

b.   Opal

c.    RPC

d.   ExoKernel

e.    Mach

2.   Didn’t change but should have

a.    VMware: ask OS for memory

b.   Mach: could have used LRPC for communication instead of ports / messages

  1. Review notes:
    1. Everybody doing fine. Will get credit proportional to the number of reviews you do.
  2. Questions
    1. Cleaning makes it hard to mount on other systems? (Aaron)
    2. Impact on middle-of-file writes?

                                              i.     Fast, but subsequent re-read not as good

    1. How could perf. be as good as AFS?

                                              i.     A: sequential writes -> sequential layout

    1. Assume crashes infrequent?

                                              i.     A: what if push to app, let app ask for durability (e.g. fsync)

    1. Capacity overhead to store extra inodes, etc.?

                                              i.     99% of live data is blocks + indirect blocks

                                             ii.     13% of data written is metadata blocks (inode, inode map, summaries, etc.)

    1. When should cleaning take place?

                                              i.     Heuristic: when disk has been idle for 2 seconds

    1. Write cost of updating inode pointers?

                                              i.     Is a cost, may take extra blocks to update indirect blocks but generally (if more than one block of a file in a segment) not too bad

                                             ii.      

  1. Reminder on project
  2. Context
    1. UFS -> FFS:

                                              i.     Replace free list with bitmap; can be done because of faster CPU

    1. Standard mechanism: static layout, update in place, fixed inode locations

                                              i.     Has reliability problem with crashes, because moving a file between a directory causes multiple things to be changed; may crash in middle

                                             ii.     At low disk utilization

  1. Opportunities:
    1. QUESTION: What were technology trends enabling this?

                                              i.     CPU speeds getting faster relative to disk

1.   QUESTION: What is implication? Can do more work per disk block to make good decisions

                                             ii.     Memory sizes increasing with CPU speed

1.   More of read traffic satisfied in read

2.   Writes still go to disk for reliability

3.   QUESTION: Is this true?

a.    On dept Linux machines, 30 reads/sec, 40 writes/sec

                                           iii.     Interesting workloads have lots of small files

    1. Asynchronous IO

                                              i.     Why synchronous I/O?

1.   For metadata

2.   Preserves consistency of directories, inodes, etc.

3.   e.g. update free block bitmap before inode

                                             ii.     Decouple CPU from disk speed by removing need for programs to wait for disk

                                           iii.     Move burden of ensuring durability to application, which must request it via sync() and fsync() instead of being the default.

                                           iv.     QUESTION: What can be async?

1.   Hard for program to wait for read to complete; requires new programs

2.   Can buffer write requests

    1. Sequential I/O

                                              i.     Much cheaper than random I/O – order of magnitude more efficient

                                             ii.     FFS spreads data around

1.   Inode separate from file

2.   Directory separate from file

    1. Engineering Workloads

                                              i.     Lots of small files ( < 8kb)

                                             ii.     Sequential, complete (in entirety) access

                                           iii.     Average file lives < ½ day

                                           iv.     Other workloads can be handled by other mechanisms

    1. FFS/UFS bad for metadata operations

                                              i.     Creating a file in two directories causes 8 random writes, half (metadata) are synchronous

  1. LFS Storage Layout
    1. Treat disk as infinitely long log

                                              i.     Cleaner removes duplicate, dead data from log

    1. Write data in units of a segment

                                              i.     How choose length? Time to write > time to seek so can ignore seek costs between segments

    1. Buffer data in memory into a segment, write all at once
    2. Issues:

                                              i.     Finding data on disk

                                             ii.     Making space for data as log grows

  1. Finding Data
    1. FFS: can calculate position from inode number
    2. LFS: Inodes not at fixed locations; show up in log segments when file written
    3. Solution: layer of indirection mapping inode numbers to segments

                                              i.     Inode map gives log location (segment number, maybe offset) from inode number

                                             ii.     Gives version number (so can detect if file deleted / overwritten)

                                           iii.     Gives last access time

    1. When reading, can read segments or just blocks (segments act as prefetching)
  1. LFS Cleaning
    1. What do you do when you run out of space?

                                              i.     Lots of log entries will be invalid; files deleted or overwritten

                                             ii.     Threading: Leave valid data in place (like FFS), write log into wholes

1.   Problem: degrades performance

                                           iii.     Copy and compact (like a GC): coalesce data from partially used segments into a smaller number of new segments

1.   Problem: finding segments to clean?

2.   Problem: when do you clean?

                                           iv.     LFS solution: break log into segments, thread between segments, copy & compact within segments

    1. Choosing what to clean

                                              i.     How to clean:

1.   Read N segments, each utilization u

2.   Write out N*u segments of data

3.   Have N*(1-u) clean segments

                                             ii.     Record statistics about each segment in segment summary

1.   Identifies each piece of information in a segment (e.g. file number and block number

2.   Summary used to determine liveness of blocks – see if latest inode / indirect block references this block

3.   Version number reduces overhead for overwritten / deleted files

4.   Result: no free list of block bitmap, no consistency problems resulting from these during recovery

                                           iii.     Policies: which segments should you clean?

1.   Framework: evaluate based on write cost

a.    How much is the disk busy for writing a byte of data, including cleaning overheads?

                                                                                                    i.     1.0 == data written once

                                                                                                   ii.     10 == read/write 10x bytes / operations (about where FFS is)

                                                                                                 iii.     == # of bytes moved to and from disk / # of bytes of new data written

2.   Result: lower utilization system have lower overhead – don’t have to clean as much

                                           iv.     Idea 1: clean blocks with low utilization

1.   Get lots of cleanliness out, because not much live data.

2.   E.g. u=0.2; read 10 blocks, write 2 new, get 8 free

3.   Result: want bimodal distribution

a.    High-utilization in some places for efficiency storage

b.   Low utilization elsewhere for efficient cleaning

                                             v.     Idea 2: hot and cold

1.   Hot blocks written frequently

2.   Cold blocks written infrequently

3.   Goal: coalesce cold blocks into a few segments that don’t need to be cleaned

                                           vi.     Policy 1: greedy

1.   Clean the lowest utilization blocks

2.   Problem: Utilization of cold blocks drop slowly till hovers just above cleaning limit

                                          vii.     Idea 3: hot vs cold

1.   QUESTION: What is benefit of cleaning a cold segment?

a.    High – long term space is retained, don’t need to move things around again. Once cleaned, stays clean for a long time

2.   What is the benefit of a cleaning a hot segment?

a.    Low – will need to clean again soon, wasted time cleaning because was going to die anyway – might as well let more blocks die

                                        viii.     Policy 2:

1.   Weight utilization by age of data (most recent modification time of any block in segment)

2.   Result:

a.    Clean cold segments up to 75% utilization

b.   Clean hot segments only at 15% utilization

    1. When to clean?

                                              i.     QUESTION: How long does it take?

1.   Not long – a second or so to clean a few segments

                                             ii.      

  1. Big picture ideas
    1. Use known locality, at write time, to drive layout, rather than predicted locality (within a file and within a directory) as FFS does
    2. Separate writing to disk and long-term layout (e.g. cleaning)
    3. Take advantage of idle cycles for cleaning so can handle large bursts
    4. Summarize information (e.g. segment summaries) for performance
    5. Take advantage of dynamic (run time) locality instead of static (file system layout) locality
    6. Layout metadata contiguously with data (e.g. inode next to data and indirect blocks).
  2. Evaluation
    1. QUESTION: What should you compare against?

                                              i.     State-of-the-art research?

                                             ii.     Commercial products

    1. QUESTION: What workloads?

                                              i.     QUESTION: When use synthetic micro-benchmarks?

1.   For understanding performance

                                             ii.     QUESTION: When use benchmarks like LADDIS, PostMark, Andrew

1.   For comparing across published results

                                           iii.     QUESTION: When use traces?

1.   For getting more representative workloads

2.    

  1. Issues
    1. LFS very good for metadata operations: create / delete (3-4x ffs)
    2. Truly random access has bad performance
    3. LFS has high CPU utilization due to extra data structures, cleaning
    4. Depends on lots of memory, multiple users

                                              i.     Must fill a segment to be efficient, single user may not fill it

                                             ii.     Need lots of cache to avoid slow reads

    1. Cleaner can cause problems for busy system at 80% utilization; cleaning is synchronous and blocks work

                                              i.     34% performance drop on TPC

                                             ii.     WHY?

1.   delay to do cleaning

2.   lack of DB locality leads to many fairly-full blocks

    1. Generally can run cleaner in background whenever have 2 seconds idle; pretty much never causes a disruption for engineering workload
    2. LFS uses a lot of memory; cache + 4 segments for moving data & cleaning

                                              i.     Reduces caches size

                                             ii.     Can’t evict a page to make more space; need additional pages to hold its metadata for a while

    1. Segment size: why 1 mb? Best size depends on both seek time and transfer speed
    2. Opportunities:

                                              i.     New data layout when cleaning disks

                                             ii.     Hole-plugging at high utilization instead of cleaning

    1. Long term trend: disks are pretty cheap. Disk performance dominated by disk read times even with a cache (because so much longer)

                                              i.     Writes can be hidden by delayed writes

    1. Overall impact:

                                              i.     People don’t use LFS; cleaning costs make disks unpredictable, makes performance really suck

                                             ii.     Big win comes from metadata updates, not data

1.   Modern OS journals metadata, writes data at a good place

                                           iii.     Big win comes from async writes; commonly used

                                           iv.     Big Win comes from better choice of inode location