Frangipani: A Scalable Distributed File System
Chandramohan Thekkath, Timothy Mann, and Edward Lee. Frangipani: A Scalable Distributed File System. Proc. of the 16th ACM Symposium on Operating Systems Principles, October 1997, pages 224-237.
Review due Tuesday 3/29
Comments
Summary:
This paper describes Frangipani, a distributed file system running atop a single virtual disk provided by Petal, a distributed block store developed by the same authors. Because of this architecture, it has a number of features that are noticeably different from other distributed file systems, like NFS and AFS, including multiple private logs and a careful locking protocol that provides strong consistency.
Problem:
Having previously designed Petal, the authors of the paper presumably wished to develop a file system that would work well with the single virtual device that Petal provides. This underlying architecture provides a number of unique characteristics--a large, sparse address space; the ability to dynamically add and remove storage; and the ability to make quick backups, among others--all of which provide opportunities for tailoring a specific file system to work well.
Contributions:
While there have been similar file systems developed (the paper cites xFS), Frangipani provides a strong focus on simplicity, much of which is permitted via its use of Petal as an underlying store. Frangipani servers (as opposed to the underlying Petal devices) do not have to communicate with each other, relying on a lock system integrated with Petal to maintain consistency. It also employs a relatively novel system of multiple logs (one per file server) to improve performance both during operation and recovery. The ease of scalability and back-up creation that it provides are also unusual, but provided almost entirely by the underlying Petal architecture; the file system has to do little to support this.
Flaws:
The overall system as designed seems reasonably solid. In general, analysis of performance is also good, though lacking in a few areas. Ideally, some comparison to another network file system would be desirable; while the AdvFS comparisons are informative, that seems to be a purely local file system, and may not work for all direct comparisons. In addition, the paper proposes using clients separate from the servers; I would have appreciated some analysis of how this would work and the tradeoffs involved therein.
Applications:
To some extent, this file system seems tightly designed for its domain; while it addresses its problem well, I’m not sure how applicable its methods are outside of cluster file systems based on a large virtual device. That said, many of its concepts are synthesized from other file systems (cache coherence, multiple logs, etc.), and this provides a good example of how to combine these concepts for a specific environment. In addition, the locking service seems to be fairly unique, and could be appropriated for other distributed file systems that desire to have strong consistency. Finally, if nothing else, this file system would likely serve as a useful template for file systems designed to run in cloud environments, which, like Petal, offer highly scalable, near-infinite storage.
Posted by: Chris Dragga | March 28, 2011 11:51 PM
Summary:
The paper presents a scalable distributed file system 'Frangipani', built on top of distributed block storage system of Petal. Frangipani uses distributed lock service to provide coherent and shared access to files through Multiple Reader Single Writer locks.
Problem:
Shared data repository serving large amount of data need to be a distributed system to be available and reliable. Such systems need to address issues like fault tolerance, consistency and load balancing which together complicate the design of such storage systems and thereby make it hard to comply with standard policies in providing such service. Using a distributed block storage, that takes care of most of aforementioned problems, significantly simplifies design of a shared data repository.
Contributions:
Frangipani depicts how a simplified distributed storage service can be built above Petal's fault tolerant and scalable virtual disk layer. Petal rids Frangipani of handling failures and load balancing in the distributed physical storage system. With Petal's support for consistent backup, Frangipani provides archiving trivially.
Frangipani performs write ahead redo logging for all metadata changes to simplify failure recovery and improve performance. Per server logs with automatic failure detection and recovery through demon processes with exclusive access to log simplifies recovery and administration of Frangipani. Frangipani uses a scalable, distributed fault tolerant lock service to provide synchronized and coherent access to shared data store at the granularity of a file system object. The distributed lock service performs automatic failure detection and recovery through leases and majority consensus.
Read-ahead mechanism and concurrent data streaming of 64KB chunks from multiple petal servers allows Frangipani to achieve satisfactory read/write performance in spite of its layered and distributed structure which involves overhead of locking and logging and lacks optimizations at the disk level.
Flaw:
Frangipani does perform any kind of exclusive caching at the client side. Since it solely relies on client's buffer cache, in case when work-set size exceeds size of buffer cache, Frangipani would perform excessive amount of redundant network communication to fetch data from Petal. Such redundant network communication may also be triggered due to the buffer cache eviction policy used by client OS.
Applications:
Frangipani can be used as a shared data repository when concurrent updates to files is not required but as a whole the file system is shared. Also, when the shared data repository is dynamic in size, Frangipani uses physical disk space proportional to the data stored on it. Petal avoids having to exclusively allocate disk space for such a repository.
Posted by: Sandeep Dhoot | March 29, 2011 03:17 AM
Summary:
The paper presents a scalable distributed file system 'Frangipani', built on top of distributed block storage system of Petal. Frangipani uses distributed lock service to provide coherent and shared access to files through Multiple Reader Single Writer locks.
Problem:
Shared data repository serving large amount of data need to be a distributed system to be available and reliable. Such systems need to address issues like fault tolerance, consistency and load balancing which together complicate the design of such storage systems and thereby make it hard to comply with standard policies in providing such service. Using a distributed block storage, that takes care of most of aforementioned problems, significantly simplifies design of a shared data repository.
Contributions:
Frangipani depicts how a simplified distributed storage service can be built above Petal's fault tolerant and scalable virtual disk layer. Petal rids Frangipani of handling failures and load balancing in the distributed physical storage system. With Petal's support for consistent backup, Frangipani provides archiving trivially.
Frangipani performs write ahead redo logging for all metadata changes to simplify failure recovery and improve performance. Per server logs with automatic failure detection and recovery through demon processes with exclusive access to log simplifies recovery and administration of Frangipani. Frangipani uses a scalable, distributed fault tolerant lock service to provide synchronized and coherent access to shared data store at the granularity of a file system object. The distributed lock service performs automatic failure detection and recovery through leases and majority consensus.
Read-ahead mechanism and concurrent data streaming of 64KB chunks from multiple petal servers allows Frangipani to achieve satisfactory read/write performance in spite of its layered and distributed structure which involves overhead of locking and logging and lacks optimizations at the disk level.
Flaw:
Frangipani does not perform any kind of exclusive caching at the client side. Since it solely relies on client's buffer cache, in case when work-set size exceeds size of buffer cache, Frangipani would perform excessive amount of redundant network communication to fetch data from Petal. Such redundant network communication may also be triggered due to the buffer cache eviction policy used by client OS.
Applications:
Frangipani can be used as a shared data repository when concurrent updates to files is not required but as a whole the file system is shared. Also, when the shared data repository is dynamic in size, Frangipani uses physical disk space proportional to the data stored on it. Petal avoids having to exclusively allocate disk space for such a repository.
Posted by: Sandeep Dhoot | March 29, 2011 03:19 AM
Summary:
This paper describes "Frangipani", a distributed file-system that uses the service of Petal (a distributed storage system) to provide uninterrupted access to user files irrespective of changes in the hardware configuration of the storage servers.
Problem:
The authors make an observation that large scale distributed file systems are hard to administer (due to size of installation, number of components involved) and more often than not require some manual intervention. Frangipani is an attempt to solve this problem by giving all users a consistent view of files, allowing transparent addition and deletion of Frangipani servers, dynamic backup support, fault tolerance (though most of these are provided by Petal)
Another motivation for Frangipani could have been that the authors had already designed a disk interface (read Petal) and now needed a file system on top of it, that could leverage the features of Petal.
Contributions:
The Petal/Frangipani system provides a solid layered architecture - that made it easy to build (compare this with xFS!) and even modular (that could pave way for component re-usability).The file system resides on a single, large (pow(2,64) byte) virtual disk provided by Petal which redirects I/O requests to a set of Petal servers and handles physical storage allocation and striping. This layered architecture simplifies metadata management in the file system to some extent.
Frangipani implements elaborate locking protocols using Lock Servers and Lock Tables. There are multiple instances of Lock Server in the system, thus eliminating any single point of failure. It solves the distributed deadlock problem - avoids deadlock by globally ordering these locks and enforcing a two-phase locking.
The 'discard primitive : block layer discard requests i.e ability for filesystems to tell low-level block drivers about unneeded sectors' seems to have originated from Petal/Frangipani world. Articles on LWN.net discuss how this feature used by Frangipani made into linux FS and many low level drivers running on solid-state, flash-based storage devices.
Faults:
Frangipani is not locality-aware! The clients don't know about locality as Petal provides only a large virtual disk with no sense of physical locality
( I am not sure if it needs to be aware, but one could imagine some applications that could benefit from clients writing to nearby disks)
Frangipani provides caching support below the FS layer which may not be needed for applications with workloads wherein there is little reuse (e.g GFS target workloads have little reuse within a single application run because they either stream through a large data set or randomly seek within it and read small amounts of data each time)
Frangipani locking mechanism seems to be too coarse. It implements whole-file locking and does not allow concurrent writes to the same file from multiple nodes
IMO, there could be security issues in the system. Frangipani servers seem to trust one another, the underlying petal system and even the locking service. One could easily modify the kernel and boot it to operate a malicious node.
I am not sure why the authors chose to implement it at kernel level. That makes it not portable at all.
Applications:
Frangipani could be used for specific applications wherein concurrent writes are not required. Two-phase locking system could also be used in distributed systems wherein strong consistency is required.
Aside: since inodes and data blocks are completely separated into different regions on petal, Is there a chances that these will be written to different petal servers? - thereby causing some performance hit during lookups.
Posted by: Rohit Koul | March 29, 2011 07:09 AM