« The Design and Implementation of a Log-Structured File System | Main | Design and Implementation of the Sun Network Filesystem »

A Fast File System for UNIX

Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. A Fast File System for UNIX. ACM Trans. on Computer Systems 2(3), August 1984, pp. 181-197.

Reviews due Thursday, 3/22

Comments

Summary:

The paper presents a faster reimplementation of the traditional file system (circa 4.2 BSD) which offers a magnitude order of improvement due to flexible allocation policies that allow better locality of reference and adapt to peripheral and processor characteristics. Certain features like advisory locks, long file names, quotas etc are added.

Problem:

The original 512-byte Unix file system offered very poor performance (2% of the disk throughput) due to the small block size, limited read-ahead and mostly non-consecutive (essentially random) allocation of blocks for files and directories. There was a need for a faster file system which could atleast saturate the disk capacity.

Contributions:

Essentially the first working implementation of a file system that closed in on the hardware limits by intelligent block allocation policies. The authors used specialized knowledge about the hardware to improve performance.
The knowledge of the disk geometry was used to guide the design of the block allocation policy for both performance and resilience. The superblock was replicated across platters and cylinders.
Rotational layout tables were a clever way to allocate blocks so that the seek time was closely co-ordinated with processor capacity.
The upper bound on getting all the inodes for a particular cylinder group is very attractive and the clever design made it possible.
The increase in block size offered better performance but it was offset by the wastage in the small files case. The new idea here was fragments - basically variable block sizes. (Fragments present in ext2/ext3).

Flaws:

File system parameterizations are very hardware specific and evolution in hardware may prompt design changes. Disks with rotationally independent platters etc. They also briefly touch upon DMA which may potentially make some of the optimizations useless.

The preallocation of a few blocks at a time is discussed but not implemented. Also, the kernel buffers chaining for contiguous reads. (The authors point this out).

Performance:

The paper is on the vastly improved performance benefits of this file system over the traditional Unix file system. It is amazing that better algorithms improved the performance over 10 fold! The users of the time would have aggressively shifted to this file system. The contributions outlined are the primary reasons for the improved performance.

Paper Review: A Fast File System for UNIX [McKusick, et. al.]

Summary:

This paper presents the 4.2 BSD UNIX Fast File System, which includes
many improvemnts on its predecessor, the 512-byte block UNIX file system.

Problem:

The problem was that the "old" file system performed poorly, especially
as the files and directories that resided on it became very fragmented
over time because of the small block size combined with the lack of a
disk-aware layout policy, nor any reserve space set aside for locality
in the face of future allocations.

Contributions:

* FFS includes a number of performance improvements by locality; it uses
cylinder groups to keep file metadata (inodes) and data near each
other. It also attempts to keep inodes referred to in the same
directory near to each other.

* By setting aside a percentage (typically something like 5-20%) to
leave working space for future allocations, FFS maintains better
locality by reducing file fragmentation.

* FFS' increased block size had a number of benefits: dramatically
increased read performance on large by making much more of the file's
content contiguous, increased disk size support, and decreased amount
of work necessary to find all of a file's blocks.

Flaws:

* The optimizations are tuned to some characterisitics of hard disks
at the time, such as multiple platters with multiple arms and heads
positioned over each other.

* The increased block size is fixed and so big (8KB) that it
practically necessitated the introduction of a complication: the use
of "fragments" (block sub-allocations) to keep from wasting too much
space.

Performance Relevance:

The FFS improvements are suitable for improving read latencies for many
applications. Directories and file blocks can both be read at rates
closer to hardware capacity and more predictably over time by not
degrading as much because of fragmentation. It's primary competitor is
ext2, which lacks only the block sub-allocations, presumably because
ext2's varying initial block size and increased disk size (adn reduced
cost) were a reasonable, simpler alternative, some 8-10 years later.

SUMMARY
"A Fast File System for UNIX" Marshall et al. discuss the implementation of the "traditional" UNIX file system, its shortcomings, and how those were addressed in Fast FS.


PROBLEM
What features should a file system provide. How to make a file system
efficient.


CONTRIBUTIONS
* Description of the structure of traditional UNIX file system

* Identification of flaws in design and implementation of traditional UNIX FS

* Taking into account hardware specifics for when designing new FS. Offsetting cylinder group bookkeeping information to avoid storage on one platter is an especially insightful mechanism

* Taking into account typical FS usage in organization. E.g. putting files in the same directory on the same cylinder group

* New features: long file names, file locking, symbolic links that
(potentially) cross volumes, atomic rename, quotas

* Enormous performance improvement


FLAWS
* Some new features, like locking, were not convincingly motivated

* FFS still delivers only 50% of disk bandwidth

* Strives to produce a jack-of-all-trades FS. E.g. argument for improving
traditional FS performance because it doesn't work well for swapping doesn't sit well with me. Clearly a dedicated FS would be optimal for that (hindsight 20/20)

* FFS seems to alter the user interface, thus its deployment may not be
transparent

* The authors did not seem to identify the "common case". It seems that their data of typical FS usage came from a single machine.

* FS parameterization, while a cool idea, may not be that useful in practice. It would also be nice if parameter optimization would happen automatically.

* I find it very weird that CPU and memory are the bottleneck for disk
transfers.


PERFORMANCE
Storage efficiency, performance overhead, access time, read, write
performance.

Summary:
The paper talks mainly about changes made to the original unix file system in 4.2BSD version to improve throughput and reliability. It also talk about some features such as file locks, quotas, symbolic links etc that were introduced with the new file system.

Problem:
The original unix file system used a single superblock per partition, and used 512-bytes disk blocks for allocation. The storage layout used is superblock followed by inode area followed by data area. These design constraints mean 1) corruption on one superblock destroys the file system 2) data read size per transaction is not optimal 3) extreme fragmentation could result in inode and data area as files are created and deleted, resulting in unnecessory long jumps and hence less throughput. New file system was mainly aimed at solving these problems.

Contributions:
- Improved throughput by use of larger block size and intelligent storage allocation. Use of global data (including processor speed, controller capabilities) to make storage allocation decisions
- Improved reliability by replication of superblock in multiple cylinder groups. This improves the chance of recovery in the event of a drive failure
- Use of fragments to reduce the impact of wastage of disk space due to bigger block size for smaller files

Flaws:
These may not really be flaws, I may just havent understood them well.
- There hasnt been much of an explantion for choosing 4096 as the block size. Paper mentions it allows storing 2^32-byte file with 2 levels of indirection. Where did that number come from?
- There hasnt been much explanation about the decision 1 inode per 2048 bytes. Shouldnt this be based on block size and fragment size?
- 16 times performance improvement over current file system - these experiments were done with different bus types. But then it may be OK, because the old file system does not take into account the hardware capabilities.
- Book keeping information is stored at varying offsets from the beginning of cylinder group and the gap between them is used for data storage. Is it guaranteed to be multiple of block size? Or will it be a wasted space?

Relevance:
I think the paper describes simple but very efficient ways of improving file system throughput and reliability. The fact that most of the concepts are still used in file systems 20+ years later says something. The only concept that probably is not so relevant today is the use of fragments - with the disk space costing nothing, savings provided by fragments isnt worth the complications it introduces.

Summary
The paper presents several improvements to the UNIX filesystem that were released with BSD 4.2. These modifications change the implementation without changing the underlying abstraction presented to the user. The changes are such that the physical properties of disks are taken into account when creating files and when laying out the file system on disk.

Problem
The main problem here is that because of the previous policy of ignoring disk performance issues in the filesystem design, the UNIX filesystem was generally only able to provide about 2% of a disk's maximum transfer speed to applications.

Contributions
* Use of larger block size with performance justification as well as the ability to split a block into several fragments when their are several smaller files in order to avoid lots of wasted space due to the larger block size.
* Exploiting the physical properties of the disk so that blocks in the same file are stored on the same cylinder and are stored in a rotationally optimal order in terms of the spacing between blocks.
* Techniques for both exploiting localization and for spreading out unrelated data so that localization continues to be available.

Flaws
This paper was difficult to have serious problems. Even if they didn't fully solve all the problems they took an existing system and provided substantial improvements with few drawbacks.

Performance and Relevance
The performance tests that were performed show substantial justified performance. Disk access performance is still a relevant topic today and the popularity of "flash-drives" raises questions as to how best to exploit newer storage technology.

Summary
This work presents several extensions to the original UNIX file system that improve the overall performance of the system and add a few new properties such as symbolic links and long file names.

Problems Addressed
Using the original file system the disk interface was highly under utilized and high throughput was unachievable due to long seek times and data fragmentation. The file system also had no knowledge of the physical properties of the underlying hardware resulting in mismatches between the requested data and the most efficient way of accessing it from disk. A major concern of the authors was that of compatibility with old software and thus the new extensions had to support all the same functionality but with better performance.

Contributions
To increase the performance of accessing large files the minimum block size was increased to 4KB however a mechanism for allowing files to use as little as the sector size was also implemented to reduce the amount of wasted space in a system with a bunch of small files. To reduce the amount of seeking from an i-node to the data, cylinder groups were created and the i-nodes for the data contained in the cylinder group were also stored within the group. The traditional free block list was replaced by a bitmap that eliminated the problem of the free list becoming very unordered resulting in many seeks to access a large file. To improve sequential performance the cylinder group would also store information about how fast the disk spins and would use that information to determine how many blocks need to be skipped in order to provide fast sequential access. Two physically sequential blocks on disk can not always be accessed sequentially because by the time the request arrives for the next block the head has already passed over the block and the disk must make a full revolution.

Flaw
Though a performance analysis was done there was not a lot of hard data to back up their claims. It would be nice to see a few graphs or charts.

Performance
Throughput for large files and latency for small files seemed to be the main aspects of performance the authors were concerned about. These objectives were achieved by studying the problems with the original UNIX file system and modifying the parts that appeared to have the largest bottleneck.

Summary
This paper is an implementation of a Unix file system that attempts to address several well-known flaws of the existing Unix file system. (Personally), the paper was particularly interesting due to its thorough concern for acting in harmony with available hardware.

Problem
The "main" problem that this paper addresses was that the existing Unix file system could only deliver ~2% of the maximum disk bandwidth. The authors attribute this poor performance to three implementation problems: there was no real concept of "disk locality," the block size (512 bytes) was simply too small, and the free list would eventually become randomly ordered.

Contributions
Cylinder groups are a pretty nice idea. Each cylinder group contains a list of the available blocks in a certain number of consecutive cylinders. This leads to easy ways to exploit disk locality and limits the number of disk seeks while accessing a file. A nice feature of this implementation is the way redundant superblock copies are stored. Instead of storing everything at the beginning of each cylinder group, the superblock/bookkeeping is stored one track further away from the previous superblock/bookkeeping information. The data is effectively "spiraled" in the disk and makes the loss of any tracks, cylinders, or platters not as catastrophic.

The introduction of file fragments. Since it is easy to observe that disk throughput increases with block size, the obvious thought is to have large blocks. However, most files are quite small, leading to increasing amounts of wasted space as the block size increases. Fragments attempt to solve this problem by allowing small files to occupy portions of a block.

File system parameterization are also a pretty interesting idea. Informing the FS of hardware capabilities allows for different techniques or algorithms which suit the physical characteristics of a disk to be specified.

The idea behind global layout policies is also neat; attempt to spread unrelated data into different cylinder groups while attempting to localize similar/concurrently accessed data. The authors say that they accomplished this by placing inodes for all the files in a directory into the same cylinder group. The global layout routines call local allocation routines to provide free blocks that try to satisfy as much of the global policies as possible.

Flaws
I'm having a pretty hard time finding flaws in this paper! There is the problem that the authors admit to (expanding a file one fragment at a time), but other than that, I can't think of anything besides the fact that while this FS is a nice improvement, it still doesn't mitigate fragmentation.

Performance/Relevance
The performance numbers pretty much speak for themselves. The new file system was able to realize at least twice the reading and writing rates of the old file system. CPU utilization shot up, though.

Another nice thing about this file system were the functional enhancements. While perhaps along the lines of syntactic sugar, the enhancements make the new file system seem as though it's not too terribly different from modern Unix and its derivatives.

Summary
The authors introduce a file system that trades CPU overhead for increasing disk transfer rates by improving locality of similar data on the disk. A block fragmentation scheme is used to keep the file system disk overhead low.

Problem
The current file system used a free block list, which causes performance to decrease after time, because adjacent blocks in the free list are not adjacent on disk.

Contributions
The major improvement of the article was to increase the spatial locality of similar blocks on disk. This was accomplished by introducing the idea of a cylinder group. Summary information about the state of the cylinder group is kept and when possible cylinder groups are left with space to allow files within the cylinder group to grow. Bitmaps instead of free lists are used to indicate the free space in a cylinder group. The paper discusses the tradeoff of keeping similar objects in the same location, but spreading dissimilar objects over the disk. For example, inodes of files within a directory are stored in the same cylinder group, but new directories are stored in separate cylinder groups to allow them to grow.

The paper also introduced the idea of fragmenting a block in order to reduce file system internal fragmentation. A file uses as many full blocks as possible, but uses a partial block to hold the information that did not fit in the full blocks. The cylinder groups� bitmap allocation is at the granularity of fragments to allow for allocation.

A number of other improvements were also added to the file system: long file names, file locking, symbolic links, easy renames, disk quotas.

Possible Improvements
With the popularity of multi-taking today, it would have been nice to see the results of the system if more than one process is contending for the disk. The system gets its benefits from locality, but when multiple processes are contending for the disk the locality will decrease. It would be nice to see how much of the improvements remain in a contention situation.

I see increased CPU overhead as a cost of the new file system, because now the file system must search bitmaps and compute hashes to determine where to write pages as opposed to simply taking the first pages from a free list. The articles analysis of CPU simply counts the overall utilization. This is not sufficient to capture the overhead, because it is expected that utilization CPU increases as disk bandwidth increases. The processor spends less time blocked waiting for disk so utilization is bound to increase. A more precise analysis of file system CPU overhead would help me better understand the costs of the improvements.

Performance
Disk throughput was the major concern of the article. By increasing throughput the system responsiveness is also increased.

Summary:
This paper describes a replacement for the traditional Unix file system that takes into account the disk access and file creation properties in order to get increased performance and recoverability.

Problem:
The main problem that they were trying to solve was that the "old file system" was very slow. The reason was that it didn't do very intelligent allocation; it just put new data in the first block that it knew was available. Over time, this led to a very fragmented file system which mean lots of seek time. (A secondary goal was to design the system for easier recovery, and a third was to introduce a couple new features. For instance, I was extremely surprised to find that the existing system -- built for a time sharing system -- didn't support quotas. To me this seems like an incredibly glaring omission.)

Contributions:
* FFS became part of many Unixes, and I believe is still the "native" file system for Free/Open/NetBSD. Wikipedia says that it was adopted by commercial Unixes too.
* Again, it's hard to know which ideas were originally theirs, but it would be almost unthinkable to design a new file system without taking into consideration properties of the disk and file accesses, how to reduce fragmentation, etc.
* Ideas on how to group related data (file + inode in the same cluster group as the directory that file is in)
* The idea to try to keep cluster groups so that they have free space by spilling files to other cluster groups while the one they start in are still far from full
* Nowadays it would also be rare to design a filesystem that didn't try to be recoverable. There's been a lot of development since FFS in this area (journaling for instance), but it's clear that they gave this some thought with, for instance, the redundant superblock copies that are deliberately placed on separate platters or the rename system call
* Implementations of other important features such as long file names and FS locks
* Some other ideas, like having fragments, are still used by some filesystems to reduce internal fragmentation. (Even if they don't take the same approach, other still, such as NTFS, take other approaches to avoid allocating whole blocks for small files.)

Flaws
* There's still a philosophical question here as to whether the Unix bag-of-bytes model for a file is "correct". Even if it isn't though, it's still hard to get away from historical issues like this.
* There's also a philosophical question about whether advisory locks are the right choice. They claim that they chose them instead of mandatory locks because they didn't want to either introduce a new security model or provide locks that programs running as the superuser would ignore. But if that's their only reason, it's not hard to imagine other implementations; for instance when you lock a file specify if it's a hard or advisory lock, but just have root ignore the flag that says it's a hard lock. Now it's not clear that mandatory locks are actually better than advisory locks either, so there could be more here.

Performance
There were a couple things they were interested in. The main one was throughput; this is what their key measurement statistics were. They increased bandwidth by two methods. First, they read in larger blocks; up to 8KB instead of 1K. Second, they grouped related together in physically close areas of the disk platters. This second method also leads to another benefit, which is decreased latency for some operations. Finally, they were also concerned with the overhead that their filesystem imposed on the contents of the disk. They addressed this by allocating more than one file to a given block with their fragment mechanism.

Post a comment