« The Design and Implementation of a Log-Structured File System | Main | Scale and Performance in a Distributed File System »

The Zettabyte File System

The Zettabyte File System.

Reviews due Tuesday, 11/11.

Comments

Sun's ZFS is a total rewrite of a filesystem that claims to complete the transition between "legacy" file systems and modern ones. Better user-level tools, a more flexible storage management system, and a clever atomic update method all make availability their primary concern, with performance being relegated to an uninteresting requirement (not sacrificed so much as assumed). Interestingly (and perhaps uniquely), their target audience, the people who benefit most from this filesystem are not users but rather the administrators who are directly affected by the manageability of the system. When compared with the writing in the FFS paper, we see here that the authors seek a completely different audience being not the end-user who benefits most, but rather the system administrator who is freed from the tangle of file system maintenance utilities that are assumed today.

Nothing is particularly inventive in this filesystem (this they say explicitly). It is, however, the first combination of these features together in a single (and very compact) implementation and deployed on general-purpose desktop machines. Perhaps a more interesting contribution of this paper is the author's philosophy and goals for the new system: He cites the change in the mind of the systems community, that the new file system should prefer to achieve greater availability as a primary contribution, rather than performance. At the same time, Bonwick regards performance as a "requirement," which makes total sense given that we do need fast access to data. His writing suggests to me that there is a budget of performance that the world can afford to pay in exchange for greater availability because performance in general is pretty reasonable today. So long as the performance of a new system is not significantly better or significantly worse, there is no reason, therefore, to report upon it.

Bonwick is so confident in the comparative interest of availability to performance, in fact, that he presents his evaluation in the form of a conversation with the file system. This method has long since proven itself to effectively communicate, "what is exciting," to a programmer or systems administrator as evidenced by the number of blog posts on the internet which follow the same format. Myself, being someone who both hates and fears systems administration tasks (if they don't work right, I have little idea what to do), I find this demonstration compelling. So should most people find it compelling so long as they accept the idea that this kind of result is appropriate to a published paper. The easiest argument to the contrary is that one can always wrap bad tools into better ones. I share Bonwick's position, being that such a wrapped tool will eventually come undone and expose or require the bad tools beneath in. Better tools are a result of a better underlying design, in my opinion, and so I stand by the validity of this paper.

There is at least one issue created technically by the trade-offs made in ZFS's design. Bonwick points out that the copy-on-write tree is used to ensure consistency and atomicity of change at the expense of time to update (perhaps, "a few seconds.") I think there is an additional, much more problematic effect of this trade-off, that it creates a larger synchronization problem in the filesystem. Consider: what happens when two files are updated in different descending branches from the root? These changes will ripple up the tree of nodes until their paths converge, at which point the changes need to be preserved without race, and without losing either change. For two files, fine, this will work. For two-hundred files at each collision (on a very busy machine, apparently), I would imagine the contention over this sort of change would bear down on the system. In the present day, one might argue this isn't a problem. However, massively parallel systems are coming into the domain of the desktop machine (so it is generally agreed in the community), and are already arrived to the high-performance-computing community. I worry that this revolution will be a problem for ZFS because of its tree structure.

In fact, as much as I am in support of the measurement of availability rather than performance, I think the next work to come of ZFS is a study and explanation of its behavior under contention by parallel threads. It isn't clear to me whether anyone has produced an excellent way of measuring this, so actually that in itself would be a valuable contribution and an interesting read. Lastly, there is the question of how ZFS fits into the increasingly popular idea of "cloud computing," where local storage becomes less important. Having spoken to him, Bonwick himself is skeptical on the possibility of good distributed file systems on all points, including their correctness and their performance. I believe, however, that given ZFS's existing concepts of a storage pool and a manager, that the same techniques could be lifted to consider nodes of storage on a network rather than just within one connected machine.

THE ZETTABYTE FILE SYSTEM

SUMMARY
This paper introduces ZFS, a new File System which basic characteristics are: strong data integrity, simple administration and large capacity. They propose a different structure for FS based on pooling storage, where different FS can share the same storage devices in a virtualized way. The file system size is not fixed and can be modified dynamically by using a SPA, also the pool storage capacity can me modified dynamically without having to reformat. The state of the storage device is always consistent by using transactions. There is error detection and correction is based on checksums

PROBLEM
The problem that they are trying to solve is how to make it easier for the personal user to administrate his file systems and storage devices. And how to make file systems more reliable and flexible.

CONTRIBUTIONS
They separate block allocation from the FS, this task corresponds now to the Storage Pool Allocator. The SPA translates blocks in the storage device to virtual addresses. It will dynamically allocate and deallocate physical blocks as a response to the request of virtual addresses malloc() and free(). By using SPA, different FS can share one pool of storage and FS sizes can change dynamically. Also the storage devices used in the system can change dynamically.

Strong data integrity is obtained by using an error detection mechanism based on maintaining a checksums of all the data in the device. And a disk consistency mechanism based on transactions are used. This avoids having to use mechanisms such as fsck that perform really bad in large FS.

The Data Management Unit (DMU) exports objects that are the traditional files. A File System is a dataset of objects. Each object is represented by a 64-bit number and can have up to 2^64 bytes of data. The ZPL makes these objects look like a traditional POSIX file system, giving compatibility with UNIX systems.

FLAWS
Does this really make sense for the personal user, is the personal user going to ever use such big storage capacities. Is it that interesting for the personal user to have always consistent data, is it important that a few bytes get corrupted. Is a personal user going to have multiple file systems implemented in his disks and make lots of modifications on them. Also the use of quotas is not necessary.

Performance is assumed, but not compared to any other FS currently available. How do the transactions affect performance. How does the dynamic grow and shrink of FS affects performance. How long time does it take to modify the FS to adjust to a removed storage device. What is the difference in time between creating a regular file system and a ZFS.

Always having consistent data in the file system by using transactions for all operations on the file system will make you loose information in the case of a system crash. To solve this problem they use a log at the ZFS Posix Layer (ZPL), but they do not specify what are the recovery mechanisms used with this log and how long time it would take to execute.

PERFORMANCE
They obtain performance by using vdevs, these can be used to implement dynamic striping facilitating the writes back to disk, which are the most performance expensive operations in a FS. Blocks in the physical devices are allocated by the SPA using a round-robin algorithm.
They sacrifice some performance for consistency, this occurs when doing checksums and when using transactions. They also trade complexity of implementation of the ZFS to reduce the complexity of usage of the ZFS.
Transactions are a popular mechanism used when multiple threads are accessing shared memory. Virtualization is used is main memory, is the way to separate address spaces from physical memory.

Summary

Jeff Bonwick et al. describe "The Zettabyte File System," a new approach to local file systems that attempts to simplify administration, improve reliability, better performance, and allow for immense capacity.

Problem

While modern file systems tend to provide a reasonably acceptable level of service, they are by no means perfect. Configuration requires an oracle to ensure that future usage is properly planned for and if a change is ever required, the reconfiguration can be excruciating to accomplish. If corruption occurs, repairing the file system may take considerable time, a problem that is likely to be exasperated as file systems grow in size, assuming, of course, that current systems can support this new size.

Contributions

· Making configuration and management of the file system more dynamic, allowing for additional storage to be added to a pool (via the storage pool allocator) and dynamically used by all the file systems assigned to the pool makes it easier to set up and manage.
· Using checksums to detect and correct errors by checking data against its hash and, if an error is detected, reading a duplicate of the data from a different location.
· Minimizing the potential for corruption by organizing the system to always maintain a consistent state on disk, thereby minimizing the need for repairing the file system.
· Remembering that, although other aspects such as reliability and ease of use are increasingly important, the performance of a file system cannot be sacrificed. To this effect, the researchers were wise to make use of prior research to ensure that writes, usually a constraining performance factor, were performed quickly.

Flaws

· Even though the ZFS system is now used in Solaris, it is not yet possible to remove storage from a pool. [1] This appears to be detrimental to the ZFS concept of making administration easy and flexible.
· Accepting the presence of "nonintuitive out-of-space errors" in their system when disk usage is very high. Ease of use usually requires that the user be capable of understanding why a failure occurred. Nonintuitive errors seem to imply that the system would show that space were available but that writes would fail regardless. This is not a good concession to make in a system—it would have been preferable to place some upper limit on disk usage to prevent nonintuitive errors.

Performance

Although performance was not the primary topic of this paper, the researchers realized that it was important to maintain at least comparable performance to other modern file systems. To do this, the authors realized that file systems tended to suffer most in the area of writes and they thus attempt to write larger portions of data. Additionally, the authors attempt to improve performance by striping data across disks.

[1] http://bugs.opensolaris.org/view_bug.do?bug_id=4852783

Summary
This paper describes several changes to the high-level design of a file system, that allow for greater capacity, data integrity, and easier administration, in the Zettabyte File System (ZFS).

Problem
Given goals of larger capacity, data integrity, and easier administration, and subject to the constraints of maintaining POSIX compliance and maintaining performance, what would a file system look like if designed from scratch?

Contributions
Previous file systems associate one file system per volume, physical or logical. ZFS uses a storage pool. This allows for the easy creation of new file systems, as well as dynamically sized file systems where the size changes with the amount of data in a system. The authors use 128-bit addresses, allowing for a total amount of storage that should not be exceeded for quite some time. (Whereas 64-bit address spaces could potentially reach limits within a decade.)

All data is part of a tree of indirect blocks, where the leaves are data blocks, and the root is called the überblock. A checksum is stored in the parent of a block for error detection purposes. As the child’s checksum is stored in the parent, the checksum of the parent can then be used to validate the checksum of the child, and so forth. An interface is provided to allow access to an “object” at a time.

Copy-on-write transactions are used to maintain consistency. The data blocks and indirect blocks to be changed are written out, and only at the end is the überblock written. A POSIX layer is also introduced. Notably, as file metadata is also an object, the consistency and error detection guarantees apply to file metadata as well.

Techniques/Tradeoffs
With their system of allocating pooled storage, the authors decouple file systems from physical storage. This extra virtualization, not present in most other file systems, allows file systems to use the pooled storage dynamically, similar to what has been done with virtual memory.

Design choices, especially checksumming, involved a tradeoff with performance. However, many known performance optimizations, such as using asynchronous writes, are still applied, presumably allowing for comparably good performance, although this is not discussed here in detail.

The Problem
Jeff Bonwick, et al. opened this paper with a humorous note on how system administrators view many chores associated with using and maintaining filesystems (crash recovery, modifying partition schemes, etc.) as intrinsic to the jobs they perform. They also note that capacity of HDDs is exploding, surpassing the needs of many existing filesystems. Wishing to address these issues, they created ZFS.

The Solution
ZFS makes use of layered abstractions: Figure 3 is a an overview of these. One or more device drivers expose an interface to their physical blocks to the SPA. The SPA maps these blocks to virtual blocks (much like virtual memory), adding them to a storage pool. The SPA also contains logic to handle block creation, deletion, etc. The virtual addresses for the blocks are then exposed to the DMU, which organizes the contents of the filesystem and provided transactional operations on data. Here, like many others filesystems before ZFS, data blocks are addressed by indirect blocks. ZFS uses copy-on-write to add or edit new blocks, allowing it to remain consistent if a failure occurs during the write process. On top of the DMU is the ZPL, which provides a POSIX interface to the OS (providing it with traditional file abstractions).

These abstractions make it possible to add disks to an existing filesystem, using vdev's to accomplish striping, mirroring, etc. within an SPA. These vdevs sound like they add a new useful level of flexibility to swapping out components within a filesystem.

ZFS uses long 128-bit block addresses (which resulted in its namesake). To put this in perspective, it is estimated that only 161 exabytes of digital information were created/captured/replicated worldwide by 2006 (http://www.emc.com/about/destination/digital_universe/pdf/Expanding_Digital_Universe_IDC_WhitePaper_022507.pdf).

ZFS is also designed to make it easy to create/grow partitions. In one of their examples they mount a separate partition for each user's home directory: this illustrates the ZFS partitioning mentality.

Omissions
They did not discuss or illustrate the performance of ZFS much, outside of noting a trade-off between CPU time and having block checksums to detect and correct problems.

One can also imagine copy-on-write adding noticeable performance overhead, as each indirect block above an "edited" (new) data block in tree must be copied before being updated. Batching, could probably reduce the overhead involved in this operation, though (by merging many changes to a low-depth block into a single copy-on-write operation).

Interesting performance problems might also arise from their virtual addressing scheme, with presumably renders the DMU unable to reason intelligently about where to place new blocks to take advantage of data locality.

Also unmentioned was what occurs when needing to reconfigure a existing storage pool, instead of just growing it. Does ZFS magically move the data around? Flexible partitions are nice, but become less interesting if the storage pool itself can not be easily changed.

Summary
ZFS is a file system designed by Sun Microsystems, it provides strong data integrity guarantees, simple administration and immense capacity.

Problems:
The administration task of partitioning a disk, creating a logical device and a new file system are a painful process, and are susceptible to errors/mistakes. The interface between file system and volume manager makes it difficult to grow or shrink file systems, share space, or migrate live data. The static file system size prevents creation of new file system on a fully partitioned disk and can users can waste lot of space. The file systems have a state of inconsistent data on disk and require crash recovery, check pointing, log recovery and fsck, which are very expensive. Considering the rate of growth of file systems, the 32-bit and 64 bit addressing does not seem to be enough. Firmware can cause corruption of data, and file systems have no check associated with that. The volume manager can’t make assumptions about higher abstractions like in case of mirroring and ends up managing data in strict consistency.

Contributions:
• The file system layering has been changed; the allocation of blocks has been moved out of file system pool allocator.
• With ZFS all storage enters a common pool, called a zpool. Every disk or array added to ZFS disappears into this common pool. ZFS characterizes this storage pool as being akin to a computer's virtual memory.
• ZFS uses a 128-bit addressing scheme and can store 256 quadrillion zettabytes. ZFS capacity limits are so far away as to work for higher decades.
• ZFS stores checksums of data blocks in the parent inode.
• ZFS allows dynamic striping and uses slab allocator.
• The file write is treated as a transaction, an event that is atomic and has to be completed before it is confirmed or committed. ZFL dataset interface allows many file systems to share same storage, without dividing the disk space.
Flaws:
The paper considers performance as one of the design principles initially, but it doesn’t mention much about read/write performance. They probably should have considered a test setup to evaluate read/write performance. The paper doesn’t mention about removal of a device. The transactional writes/updates can lead to loss of data during crashes and using intent logs for such cases again seems to be expensive. They don’t mention a way to overcome corruption without mirroring.

Performance:
Copy-on-write transactional model keeps the on-disk state consistent at all times, there's no need to perform a lengthy file system check after forced reboots/power failures. Using a pooled storage makes creating file systems as easy as creating a new directory. You can efficiently have thousands of file systems, each with it's own quotas and reservations, and different properties.

Tradeoffs: They traded off some amount of performance for data integrity (checksum all on-disk data).
There is another tradeoff made: File system is made robust at the cost of some data loss during transactional failures.

Another area: The same technique can be used in NFS server. Batching several write requests for same file and replying back only once after committing. Transactional model is already being used in TxLinux.
Instead of having each device reference separately, for e.g. a printer, may be printers could be in a single pool, addressed by a single device driver.

Summary: The authors present the Zettabyte File System as an alternative to current file systems, which focuses on increasing data consistency, making administration easy and allowing for larger capacity drives. They propose a clean-slate approach which focuses on the bottleneck of current systems: writes to disk.
Problem to Solve: The problem that the authors are trying to solve is that the structure of current filesystems is slightly out of date and the problems that were targeted previously are not the same performance problems seen today. Today, there is increasingly a need to support a large number of disks which can combined be a terabyte or more of data. Also, the need to ensure consistency of the disk while the system is running is more important. Also as the speeds of hardware has increased, writes have turned out to be a bottle neck with current file systems.
Contributions: First, the authors introduce the idea of 128-bit block addresses for data storage. This is especially important to support the larger disk drives and hardware configurations that are being used today. Another really nice feature their implementation has is that supposedly doesn't take anymore time to use these larger block addresses compared to 32-bit ones. Secondly, they introduce the notion of SPAs and DMUs which are the building blocks for creating this system that allows for visualization of blocks on disk. This is important because one of the crippling parts current file system administration is the inability to add new partitions without adding more drives if all physical space as already been partitioned. Thirdly, since they do create this new storage model they are able divide the things that need to be done to different parts of the system creating a balance between what needs to be done. This helps ensure that different parts of the system are not being held up by others. Another contribution they make is the notion of consistent data on disk. Although this had been developed in other system, they made a point to integrate it in a way in which it allows for quick recovery if data does become corrupt without user intervention.
Flaws: One flaw in the paper is that they don't explain if data corruption is fixable by any other method than having a storage pool that is mirrored elsewhere. Mirroring larges amounts of data on disk seems like it would have a large overhead over time that is not mentioned either. Another flaw of the paper is that they don't discuss the performance overhead associated with the checksums they perform, only that there is a one. It would be beneficial to see how this effects performance as disks space is used or for larger versus smaller file comparison.
What tradeoff is made: One tradeoff that is made is that since blocks are now being virtualized, files will be spread throughout the disk. So for the ability to quickly write, your increasing your read time specially for large files. Another tradeoff that is made is that in order to have a consistent on-disk state, they need to implement a fairly complex smarter block allocation algorithm that could force writes out to disk still in batches but quicker.

The Zettabyte File System

Summary
The paper discuss how the goals of strong data integrity, simple
administration, and immense capacity were achieved in the Zettabyte File
System.

Description of the problem being solved

The authors want to solve the current and upcomming problems of the
file systems by taking a fresh look at File System Design and Development from
the scratch. They focus on a new file system that is easier to administer by
non experts and has strong data integrity. They use techniques like pooled
storage, transactional copy on write model , self validating checksums to
achieve this.

Contributions of the paper

1. Pooled storage : Decoupling file systems from physical storage makes it
easy to dynamically grow and shrink the file system.

2. Strong data integrity : Using checksums and transactional copy on write
guarantees always consistent data. Also, intent-logs can be used to achieve
completeness of data during crashes in the middle of a transaction.

3. Immense capacity : ZFS can support huge 2^128 bytes of storage which will
be sufficient for the next couple of decades.

4. Object oriented storage model : The ZFS provides a object oriented view of
the storage through the datasets , offset abstraction which makes it intuitive
and easy to write and extend the file system code.

5. Simple administration : ZFS is very easy to administer with simple commands
and also because of features to add or remove disk devices to the system
automatically, easier creation of file system ( as easy as creating
directories )

Flaws in the paper

1. More details on performance is definitely needed : The paper does not give
any detail on the performance of the ZFS compared to the other file systems.
Though the paper claims excellent performance there is not proof in the paper.

However, I looked up on the internet and it seems that ZFS performs better
than the conventional Sun UFS file system in most cases and as good as UFS in
the worst cases.

Techniques used to achieve performance

1) Proper division of labor between interacting components : Storage
allocation is completely handled by a separate component ( Storage allocator )
unlike traditional file systems.

2) Recovery oriented computing : The authors have cited David Patterson et
al's advice towards sacrificing some performance for reliable sytems. This
principle shows up in the design - Checksums for all blocks.

Tradeoff made

1) Performance vs ( reliability & features ) : The authors have chosen
techiniques like copy on write which requires smar allocation alogorithm like
slab allocator. Also, this consume more resources but provides more features
like allocation at any unallocated place etc. Also, checksumed blocks provide
integrity at the cost of performance.

Another part of OS where this technique could be applied

1) Division of labor could be used in any system with interacting
components.For example, a) division of labor in a grid computing environment.
The central allocator can assign jobs to nodes but a component in the node
should be in full control of how to efficiently complete the job. b) recovery
oriented computing.

2) Recovery oriented computing can be applied in any system where
communication takes place. For e.g. , communication over wire can use
checksums for all data to guarantee integrity of data.

Summary
The paper describes the zettabyte filesystem which is a filesystem with primary focus on data integrity, recoverability and ease of administration. Self-validating check-sums ensure error detection. Transactional copy-on-write provides always consistent on-disk data. Storage Pool Allocator simplifies administration.

Problem attempted

The main objectives of the paper are automating administration of storage, decoupling file-systems from storage, supporting dynamic changes to size of the file system and ensuring always consistent on-disk data.

Contributions

1) ZFS has implemented a Storage Pool Allocator(SPA) which exports a virtual address interface instead of a block device interface. The layers above SPA layer are unaware of the actual location of a particular block and hence addition or removal of a new device is easy. Besides, multiple filesystems can share the same storage pool

2)There is no need to statically determine the amount of metadata that must be allocated to a filesystem. A filesystem can, by default, use as much storage as necessary from its storage pool.

3) The functionality of filesystem has been separated into two separate parts, namely the ZFS Posix Layer (ZPL) and the Date Management Unit (DMU). Data blocks are maintained as leaves of a tree with indirect blocks as their parents. The uberblock is the root of the tree.

4) Checksum for every block is maintained in its parent indirect block (except the uberblock, which contains its own checksum). This arrangement reduces the probability of losing both data and checksum. Besides, since all data is checksummed, self-healing of data is possible if we maintain mirrors of storage pool.

5) Transactions are implemented by writing all data blocks and indirect blocks and finally rewriting uberblock for entire transaction. If rewriting of uberblock fails, DMU will read a backup uberblock from other location.

Flaws

1) Inspite of batching many transactions together, the cost involved in making change to a leaf-block is very high, requiring changes to all the ancestral indirect blocks till the uber-block. The system gives an option to turn-off check-summing, but since this is the only mechanism to maintain consistency of data, it is not a good choice to turn check-sums off.

2) Abstracting a device always comes with the problem of making certain optimizations impossible. In this case, the Data management Unit does not have any control over where data should be placed.

3) Performance results are not provided. Probably this was not the paper focussing on performance.

Trade-offs

1) Performance vs error-detection and data consistency - Computing check sums and changes to indirect block changes bring down performance, but they can detect and often correct any disk data errors.

Techniques used

1) Batching - batching many transactions together before they are written to disk.

2) Layering and division of labor - moving out the task of allocation from file-system into separate storage allocator.

In this paper the authors present ZFS, a general-purpose file system whose basic goals is to provide strong data integrity guarantees, simple administration and immense capability. ZFS is implemented from scratch and is subject only to the constraint of POSIX compliance. The authors first describe the designed principles they used to design ZFS and then present its architecture.

On the existing file systems, operations like partitioning a disk, creating a logical device or creating a new file system are slow. Because these operations were performed only by system administrators there wasn’t much pressure to simplify them. As more and more people become their own system administrators the need to simplify and automate these operations arises. ZFS’s goal is to achieve that by providing simple administration, pooled storage, dynamic file system size, consistency, immense capacity and error detection and correction.

One of the main contributions of ZFS is that it is decoupled from physical storage. Allocation is moved out of the file system and into a Storage Pool Allocator (SPA) which handles block allocation and I/O and exports virtually addressed blocks. Unlike other systems, there is no association between the file system and a particular storage device. In this way, dynamic addition and removal of devices without interrupting service is allowed. Another benefit of this abstraction is that it facilitates system administration. There is no need to create logical devices and partitions. Another contribution of ZFS is the use of virtual devices (vdevs). Each vdev is a small set of routines which implements a particular feature (mirroring, striping ,e.t.c). As far as it concerns striping, SPA balances the write load across all the disks in a pool. In this way throughput is maximized as the total disk bandwidth is increased. ZFS keeps data consistent at al times by treating all blocks as copy-on-write. When ZFS writes new data , the blocks that contain the old data are retained. It uses checksums to protect the blocks against data corruption, and sometimes repair the damage. ZFS has a transactional –object based interface which helps retain consistency and facilitate the allocation of metadata (by allocating an object and write the data into it). Finally, it is a 128-bit file system, so it has large capacity.

One of the flaws of the system is that some operations can add overhead. For example, the checksums and the copy-on write operation are time-consuming and may decrease the performance of the system. The authors don’t present experimental results so we can’t confirm that. Moreover, one of the main goals of the system is to have excellent performance. However, that’s the only principle that is not tested. In my pinion, the authors should have compare and contrast ZFS with other file systems and present experimental results, so that we can have an idea about how well the system performs.

As far as it concerns the techiques used of the system, ZFS uses copy on write to keep data consistent, a pooled storage ,immerse capacity, a mechanism for error detection and correction and a transactional object-based interface. As far as it concerns performance, it used dynamic striping to balance the write load and as a result to increase throughput. Moreover, similar to LFS it batches write operations and groups many transactions together so that a group of blocks is only rewritten once for many data block writes. The trade-off here is that to in order to keep the system simple and facilitate the users and the system administrators sometimes performance may be decreased (e.g checksums). Another trade-off is that by using copy on write more space is consumed. ZFS can be used for database applications because it could facilitate transactions.


Summary
The paper brings up a radical change in approach to the local storage problem by redesigning the storage system to integrate the volume manager and the filesystem. This allows pooling multiple storage devices. Also incorporated are a transactional copy on write model, and self-validating checksums.

Problems addressed
- The 1-1 association of filesystems to a partition / disk requires each new partition/disk to be "set-up" (albeit once) before it can be used. Also, increasing the size of the filesystem is tedious.
- Consistency of data on disk: Contemporary filesystems allow windows of inconsistency. Further, recovery from crash requires O(data) processing. A related problem is error detection.
- Huge storage capacities available: Contemporary filesystem data structures and algorithms are not scalable to work on such huge capacity storage pools.

Conributions
- Integrated volume manager cum filesystem design: This allows for pooling a large and dynamic set of disks into one volume onto which multiple filesystems can be mapped.
- The storage pool allocator exports virtual addresses to the DMU thereby hiding the exact mapping of blocks onto devices. This allows the storage pool to be dynamic. The block allocation is also decided by the spa.
- The storage of on disk data and metadata as tree of blocks with self-validating checksums allows scalability to huge data volumes and low expense error detection and correction. Also notable is the use of 128 bit block addresses.
- The transactional copy on write model on the tree of blocks allows for greater consistency of disk data. And corner cases are mitigated by intent logs.

Flaws
- The filesystem in ZFS appears to be more closer to a directory than a filesystem per se.
- The rippling up of cow updates for updating checksum information looks to be an expensive aspect of write.
- The filesystem cannot exploit any knowledge of the storage medium to optimize for performance. This is the cost of the strict separation introduced here
- The paper furnishes no performance measurements.

Techniques used
pooling of storage by mapping multiple filesystems across multiple storage media. Abstraction for providing simplicity of interface and implementation - for example, the SPA completely abstracts away the storage pool. Self validating checksums for error correction etc.

Tradeoff
One major tradeoff is performance against scalability and simplicity. Traditional logical locality layout policy is left out for simplicity. The tree structure of blocks avoids performance optimizations like storing some direct block pointers in the inode for scalability.

Other areas
A comparison of pooling of filesystems onto storage media can be made with hybrid threading where multiple user level threads are pooled onto multiple kernel level threads to improve flexibility.

This paper presents ZFS, the zettabyte file systems. This filesystem is a complete redesign, motivated by the current state of the art in storage devices. ZFS provides data integrity, improved administration and supports practically infinite capacity.

The main contribution of the paper is the integrated error detection mechanisms that detect silent data corruptions. Another contribution is the pooled storage architecture which decouples blocks from physical storage and allows for administrative simplifications. Also, an interesting idea is the dynamic nature of many file system parameters, like partition size and block size.

The authors praise the transactional copy-on-write scheme as the ultimate solution to high-speed write performance. I can't see, however, how the transactional semantics can benefit existing applications that are running over the POSIX API, as the filesystem must keep the image of the file before and after the modification. Commutativity and batching of write operations would speed things up, but both require application-specific knowledge which cannot be communicated to the filesystem. Finally, the authors use "scalability" in an interesting way: they claim that ZFS is scalable in terms of the disk space attached to a single node. My understanding of scalability referred to the number of independent tasks that can be carried out as the system size increases. Achieving the first is easy: you can use more bits for each address, as they do for ZFS. Achieving the second one is hard, but the one what matters most to users. Unfortunately they fail to show how fast ZFS really is, compared to existing filesystems.

The paper presents an interesting design trade-off: The improvements you can achieve by doing a clean design versus the reusability of existing components if you stick to a popular, well-defined API. The observation that storage systems have changed over the last 25 years and therefore file systems have to be redesigned is very appealing to researchers, but unfortunately the industry and the users judge a system by whether it's "good enough". In this respect, a brand new file system which can store zettabytes of data is not very appealing.

Summary:
This paper presents ZFS, developed at Sun, which provides a redesign of existing file systems by providing a new interface between FS and volume manager, pooled storage and transactional COW model, thereby architecturing a new storage stack design. At the same time, ZFS avoids to trade-off with implementation simplicity.

Problem:
Conventional file systems usually suffer from problems like complicated administration, poor data integrity guarantees and limited capacity. This paper presents ZFS which solves these problems by a clean-slate FS design. Major features of ZFS which help solving these problems are pooled storage, object-based storage, block-level checksumming and 128-bit virtual block addressing.

Contributions:
ZFS presents a clean-slate FS design. Though most of the techniques used in ZFS are already known in context of databases, cryptography, etc. ZFS integrates them together. One of the most important contribution of ZFS is decoupling the block allocation mechanism out of the file system into the storage pool allocator. This helps in virtualizing the physical disks from the file systems.

Other major contributions are extended block addresses (128-bit) which provide larger FS sizes (though still constrained due to OS in some cases, eg. 32-bit dev_t stat structure). Error detection and correction adds integrity guarantees to ZFS while providing self-validating checksums. Transactional COW semantics of the DMU layer frees the file system to carry out atomic operations in any order (eg. for deleting a file). Finally, ZPL takes away the burden from the administrator for creating new FS using mkfs.

Flaws:
Though the paper does mention the existence of performance trade-off of checksum validation and computation, it does not quantify this. Similar is the case for copy-on-write for data blocks. Leaf blocks, either for checksum updates or COW, will require cascading updates to the indirect blocks along the branch of the tree. This may result in high performance overheads, and needs to be quantified for ZFS.

Techniques used:
Copy-on-write semantics for the DMU layer and transactional interfaces provide atomicity of operations. Decoupling allocation of blocks from the FS into the SAP (separating mechanism). Object-based design for DMU. Snapshots for on-disk consistency by self-validating checksums. Simplicity of implementation (following the Occam's razor).

Tradeoffs:
COW of every block provides consistent on-disk data but at the same time consumes up more space, which may itself generate out-of-space errors when disk is full. Another tradeoff is between performance and consistency (by using self-validating checksums for all blocks).

Alternative uses:
ZFS finds great use for database servers, because of its in-built transactional semantics (DMU). Other uses could be file-servers for which dynamic file-system size is an added advantage due to ever-increasing workloads.

Summary
In this paper, the authors present a 128-bit file-system implementation meant to reduce the administrative overheads of maintaining a file system and increase the reliability of the system. Their target audience is not an average desktop/laptop user but system administrators who have to maintain huge file systems, partition them, add/remove disks etc.

Problem
Current file systems have huge maintenance overheads and don't provide the flexibility or reliability needed for modern day systems. Adding a new disk, creating a partition, mounting a new file system is a painful process and as a result subject to mistakes. Also, file systems provide no run time reliability and recovery requires a scan of the entire disk(fsck) which is very time consuming(e.g. CS AFS outage).

Contributions
- They have increased the addressable space from 2^64 to 2^128 which should easily take care of our disk space requirements for many years to come.
- The abstract layer provided by the Storage Pool and the Virtual devices allow easy management of the hardware devices and file system partitions. Adding a new disk, increasing the file-system size, mirroring the file system etc which have been very complicated tasks are a lot simplified because of this layer of abstraction.
- Since the inodes are dynamically created, creating a new file-system is a lot simpler. Also it doesn't put any limitation on number of files(2^32 because of Posix compatibility) or waste unnecessary space for inodes that are never used.
- Checksum of every block ensures reliability and can correct corruption easily with mirrored file systems.
- Since everything in this system is an object and all objects are stored on the disk, any change in the system can be transactionalized. All writes to disk are not visible till the final change in the uberblock, so the order of operations doesn't matter anymore.

Flaws
- Though the entire tree of indirect blocks is a novel idea, it has a huge cost associated with it. All these writes cause multiple unnecessary copies. But without any performance numbers, we can just assume that this would reduce the write performance. Also this means that you need to have free space for all the copies for each block(Approx depth of tree * Number of block writes buffered at leaf).
- Also there is no mention of how are the blocks which are no more pointed by the new uberblock deleted and wouldn't a crash at that point cause these unused blocks to go undeleted?

Technique
- The storage pool allocator decouples the relation between the file system and the hardware. This abstraction allows the underlying hardware to change without any upper layers even knowing about it.
- There is a trade-off between performance and reliability of the system.(Checksum/Copy-on-write transaction)
- As pointed out by the authors, the lower layers of this implementation can be directly used by a DBMS.

Summary:

The Zettabyte file system develops a way to virtualize how disks are presented to the user. This includes a means for using multiple physical drives as one file system, and the ability to dynamically change or add disks to the set. To accomplish this, the Device Driver communicates to a Storage Pool Allocator which virtualizes addresses to the Data Management Unit which creates objects for the ZFS POSIX layer used by system calls.

Description of Problem:

File systems experience a number of user visible issues such as difficulty in configuring and manipulating a file system, waiting for fsck, or dealing with fixed partition sizes.

Contributions:

-Decoupling physical storage from file systems similar to memory banks.

-Implementation of a database style commit for ensuring data write integrity thorough a system crash.

-Implementation of CRC checking on all data of a file system that is attached to the parent node to save from an additional disk read.

-Increasing the addressing to 128 bits

Flaws

-The paper does not contain any performance statistics of the ZFS file system. How does the ZFS compare to EXT3, XFS and other file systems using different benchmarks? Their statement is that “performance should be excellent”, however this is not backed up with reasoning.

-A major factor for requiring the replacement of a physical storage device is a disk crash. In the case of a disk crash, mirroring is possible as shown in the ZFS in action section, however, ZFS goals of maximizing capacity and strong data integrity should consist of a level of data redundancy without waste, such as RAID 5.

Techniques

Techniques used by the ZFS file system are redesigning from scratch the file system layers, utilizing a database style write for integrity and grouping transactions, increasing address size, and utilizing a tree structure for data and integrity storage. The tradeoff made by ZFS is processing power for increased structure. The tree structure and additional file system layers require more processing overhead and memory utilization, but benefits from grouping writes, and the file cache reads with a stronger level of data integrity. The technique of pooling storage into one file system could be abstracted to pooling of processors (local or remote) into a single entity and delivering scheduling tasks as work builds up to minimize I/O and network contention.

Summary:
The paper describes a new file system, ZFS that provides simple administration, transactional semantics, end-to-end data integrity, and immense scalability. ZFS is not an incremental improvement to existing technology; it is a fundamentally new approach to data management. ZFS adds a few changes to the traditional high-level file system architecture including a redesign of the interface between the file system and volume manager, pooled storage, a transactional copy-on- write model, and self-validating checksums, then achieves performance improvements.
The goal the paper was trying to deal with:
Since the traditional file systems has no protection for silent data corruption, cannot guarantee correct data, hard to manage, are not flexible storage space management and has too much complexity with FS and VM, the goal of ZFS is to make file system scalable, manageable, dynamic, safe and easier.
Contributions
1. ZFS proposes a pooled storage model that completely eliminates the concept of volumes and the associated problems of partitions, provisioning, wasted bandwidth and stranded storage.
2. Fast ZFS has a pipelined I/O engine, similar in concept to CPU pipelines. The pipeline operates on I/O dependency graphs and provides score boarding, priority, deadline scheduling, out-of-order issue and I/O aggregation.
3. ZFS provides end-to-end data integrity, and has the following characteristics: (1) Checksum checked after block is in memory (2) Checksum and data block stored separately (3) Protects from accidental overwrites
4. ZFS provides unlimited constant-time snapshots and clones. A snapshot is a read-only point-in-time copy of a file system, while a clone is a writable copy of a snapshot. Clones provide an extremely space-efficient way to store many copies of mostly-shared data such as workspaces, software installations, and diskless clients.
Flaws:
(1) ZFS doesn't support per-user or per-group quotas. Instead, it is possible to create user-owned file systems, each with its own size limit. (2) ZFS is not a native cluster, distributed, or parallel file system and cannot provide concurrent access from multiple hosts as ZFS is a local file system. (3) It is not possible to add a disk to a RAID-Z or RAID-Z2 vdev. This feature appears very difficult to implement. (4) We cannot mix vdev types in a zpool. For example, if we had a striped ZFS pool consisting of disks on a SAN, we cannot add the local-disks as a mirrored vdev.
The techniques used to achieve performance:
ZFS applies the technique of pooled storage, a ZFS storage pool is really just a tree of blocks. ZFS provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer--not in the block itself. Every block in the tree contains the checksums for all its children, so the entire pool is self-validating. When the data and checksum disagree, ZFS knows that the checksum can be trusted because the checksum itself is part of some other block that's one level higher in the tree, and that block has already been validated. ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy. ZFS also introduces a new data replication model called RAID-Z. It is similar to RAID-5 but uses variable stripe width to eliminate the RAID-5 write hole. ZFS backup and restore are powered by snapshots. Any snapshot can generate a full backup, and any pair of snapshots can generate an incremental backup. ZFS has the following features (1) Pooled storage model (2) Consistent on disk (3) Protection from data corruption (4) Live data scrubbing (5) Instantaneous snapshots and clones (6) Fast native backup and restore (6) Highly scalable and so on.
Tradeoffs:
(1) ZFS is designed with focus on data integrity, recoverability and ease of administration, instead of focusing on performance. (2) Copy-on-write of every block provides always consistent on-disk data, but it requires a much smarter block allocation algorithm and may cause non-intuitive out-of-space errors when the disk is nearly full. (3) ZFS gives some tradeoff on performance in order to checksum all on-disk data.
Another part of the OS where the technique could be applied:
Pooled Storage technique used in this paper can also be used in the area of Storage Area Network in Operating System; pooling storage at the Storage Area Network level provides a number of benefits including more flexible storage provisioning, easier quality of service adherence, storage consolidation, centralized storage management, Extreme scalability, and lower overall costs and so on.

Post a comment