Frangipani: A Scalable Distributed File System]
Chandramohan Thekkath, Timothy Mann, and Edward Lee. Frangipani: A Scalable Distributed File System. Proc. of the 16th ACM Symposium on Operating Systems Principles, October 1997, pages 224-237.
Review this or Petal for Thursday, March 18th.
Comments
Summary:
This paper talks about Frangipani, a 2-layer distributed file system built on top of Petal. It maintains scalability, fault tolerance, easy administration features of Petal, and extends it to a cluster file system. The administration techniques and implementation of Frangipani are described in the paper.
Problem Description:
Frangipani is designed to work with Petal, which is a highly available and scalable storage system. It provides a consistent view of set of files to all users and behaves like a local file system. It considers issues to achieve scalability, performance, load balancing, easy administration, failure recovery, etc.
Contributions:
Because Petal is already there, providing a large address space of virtual disks, data block managing, easy scaling, and consistent replications, fault tolerance, etc, the building of this 2-layer system on top of Petal is much simpler. The issues addressed by this paper include:
-failure recovery, by write ahead redo logging for metadata.
-A distributed, coarse-grained file/directory lock managing system. It implements the lock status replication using Paxos algorithm. It also introduced lock lease to deal with client failures.
-The standard system call interface provided to users.
Applications:
The paper assumes that Frangipani, as well as the underlying Petal, are running in a trusted network environment, which is not very practical. The coarse-grained locks and pre-determined locking order can hurt performance. Also I am wondering what drove the authors to implement such a system on top of Petal, other than taking different approaches.
Posted by: chong | March 18, 2010 01:41 PM
The Frangiopani (hereafter FP) was designed as a network filesystem with scalability in mind, leveraging the Petal virtual network disk system. It utilizes a few existing distributed computing techniques to accomplish this, such as Paxos, logging, and leases.
The Petal back-end allows FP many features that simplify the implementation greatly, especially for some advanced features. The FP servers store the necessary state for FPs, so that if any FP becomes unresponsive, any actions it attempted to take can be recreated. Similarly, the state of locks in the system is stored on Petal. Backups can easily be made using the Petal snapshot functionality. Another interesting feature is that Petal virtual disks can be indexed arbitrarily with 64-bit keys, so FP can address any of 16 exobytes without there needing to even exist that much storage. The layout can be as sparse as desired, providing room for easy expandability.
Locality sort of gets turned on its head with FP. One of the big advancements made with FFS was increasing locality of related files, and at the same time distributing the load over the disk. Petal virtual disks are distributed unconditionally across many disks in many servers, so you get the distribution for free that FFS had to earn. With the distribution, operations can be easily parallelized. Locality would only degrade performance. Filesystems like FFS just aren't applicable, so filesystems like FP become necessary.
One thing I don't particularly like is that an inode is the same size as a block. Blocks used to be as large as sectors: 512 bytes. Now, blocks are written 4096 bytes at a time, so individual sectors cannot be written exclusively. It probably wouldn't be such an awful thing to have to lock an entire block for a single inode, so it should make sense to reduce the size of inodes to a more realistic size like 64 bytes.
Although not particularly appropriate in all cases due to the poor security properties, they are not too difficult to achieve. A Petal cluster could exist on a private network, with FP front-ends exposing an interface to the outside. The servers running FP can enforce whatever they want, and the result becomes multiple entry points into a huge and reliable data store.
Posted by: Markus Peloquin | March 18, 2010 01:21 PM
Summary:
Frangipani is a cluster file system designed specifically to run on top of the Petal distributed virtual disk service. By cluster file system, the authors mean multiple instances of a file system driver on different physical machines sharing a single address space storage and file system namespace. Additionally, clusters like this tend to require a single administrative domain due to the large amount of shared writable data.
Problem:
Previous work by the authors, and a number of hardware solutions can present a single logical block device to multiple clients. Given such a block device, how do we export it to many users in a useful and consistent manner?
Contributions:
Because Petal and Frangipani share a common goal and were designed with interoperability in mind, together they form a very elegant system. By making no assumptions about block storage aside from an efficient representation of a huge sparse address space, Frangipani is able to defer allocation decisions to Petal. As a storage manager, Petal is perfectly situated to make intelligent choices regarding block placement and replication.
In exchange for the impossibility of allocation level optimizations, Frangipani is able to do amazing block and file allocation tricks by essentially statically allocating storage for all structures and data in advance. Large files have a highly efficient representation (probably similar to extents when you consider the translations done by Petal). In addition, control and metadata structures are easier to partition using locking because all locations are preallocated in the virtual address space. This largely removes the possibility of collisions assuming a correct locking discipline.
Practical Applications:
Although cluster file systems have poor security properties (across administrative domains) due to share write access to the entire block device, this not a huge drawback. It happens that most applications for cluster file systems fall within the sheltered backend of a large service or private computing facility. GFS could be considered one of the great successes of cluster file systems that share many of these properties.
Posted by: Joel Scherpelz | March 18, 2010 05:00 AM
Summary:
Frangipani is a distributed file system that attempts to meet all needs and expectations. Frangipani is built upon Petal, a distributed storage system, and provides fully coherent data sharing, incremantal scalability, and very high availability. The paper discusses the overall structure of the system, the logging and recovery systems, the cache coherence system, and the backup system in detail. The paper also provides an empirical evaluation.
Problem description:
In my opinion, Frangipani is more of a solution looking for a problem. Many distributed file systems have been previously proposed, and none of the ideas in this paper seem particularly new. However, Frangipani does uniquely leverage Petal, and the layered implementation approach allows their rapid development of something that typically takes many times longer to prototype and evaluate. This piece does make their take seem rather impressive.
Contributions:
Frangipani is a distributed file system that does many things rather well, which is somewhat special. Frangipani appears to be among the first (if not the first) distributed file system to be (a) fully coherent, (b) highly fault tolerant, and (c) relatively scalable. In some sense, Frangipani is like a distributed shared memory system, but where the unit of addressing is the file. The other clear contribution is Frangipani's elegant layered architecture, which is simple to understand and was apparently relatively easy to develop.
Applicability:
Frangipani seems like an excellent distributed file system architecture for moderately sized clusters. It is probably not for 1000-node clusters, but the system would probably scale to clusters with 10s of nodes. The authors do not attempt to evaluate more than 10 nodes, and this is probably wise as performance would likely suffer. Unfortunately, it's hard to imagine a cluster environment where you would not want your system to scale to more than 10-100 nodes. Data center clusters are certainly much larger in size, and therefore are probably not well suited to Frangipani.
Posted by: Marc de Kruijf | March 18, 2010 04:38 AM
The paper "Frangipani: A Scalable Distributed File System" is about the Frangipani file system, implemented on top of Petal, which is a storage system that attempts to be globablly accessible, always available, have good performance (through load balancing and replication) and allow adding or removing disks without needing much administration. Frangipani has many of the same goals as Petal, but on a higher level: instead of being a storage system, it is a file system, but it still wants to have good performance, easy administration, global accessibility, and fault tolerance.
The problem addressed in the paper is how to make a file system that is accessible for concurrent reads and writes from different clients, and maintains consistency while still being available. The system needed to be able to survive the loss of some nodes without losing data or making the system inoperable (it needed to be able to degrade naturally). They wanted it to be scalable to many servers and many clients.
The paper contributed the idea of separate lock servers and Frangipani servers, which could reside on the same machines but did not have to. They used a lot of logging to ensure consistent operations. They used locks and leases to manage who was writing to a file at any given point. They also showed that it was possible to build a file system on top of the virtual disk interface of Petal that extended and added to the benefits of Petal, rather than masking them. They used the snapshot ability of Petal to make backups of the file system. Adding and removing servers (whether for Petal, locking, or Frangipani) is easy and requires only changing the machine being added or removed; a property of being tolerant of a permanent loss of a server is that when removing one, it is sufficient to simply shut it off.
The authors did run some tests of the system, and found that it was performing comparably to other file systems for up to seven Petal nodes and six Frangipani nodes. They were next planning to use it for their own day-to-day use. It would be interesting to see whether this worked out for them, or if they discovered problems with scalability or performance.
Posted by: Lena Olson | March 18, 2010 04:35 AM
- Summary
This paper describe the design, implementation and experiment of Frangipani, a double-layer distributed file system that ensures availability, scalability, shared access and minimal human administration.
- Problem
Building an ideal distributed file system is hard, due to the *distributed* nature. Some of the challenges are: how to ensure consistent view of data? how to deal with failures? how to provide reasonable performance, and so forth.
- contribution
The idea of Frangipani is to leverage the Petal virtual disk layer in order to implement file system interface on top of Petal. This approach relieves implementer from worrying about read and write to blocks, since it is virtualized. Also, this approach makes the addition/deletion of server transparent, which eases administration task. The only concerns that Frangipani has to deal with is: dealing with failure, provide consistency. Do deal with these problems, Frangipani exploits write ahead logging, uses a multiple reader/single writer model, and implements a distributed locking services using Paxos algorithm.
- comments/flaw/question:
I disagree with Frangipani's design choice that places user program and file server module in a single machine. This will limit it usages in the sense that only trusted/verified machine can be connected to the system.
Also, the locking service is implemented using multiple lock servers that make agreement using Paxos consensus algorithm. I think this would be a bottle neck of the system in highly concurrent workload.
Finally, i don't like the idea of of locking the entire file or directory for the same reason as my second point.
Posted by: Thanh Do | March 18, 2010 04:26 AM
The "Frangipani" paper presents the design of a distributed file system ensuring data coherence while allowing shared access to the same set of data.
Frangipani system kind of shares the motivation with its storage provider "Petal" to provide a scalable, highly performaing with large amount of storage space. The main challenge in these types of system are to ensure data coherency, make it fault tolerant and also maintain data integrity. Frangipani leverages many features supported by the storage layer "Petal" to achieve its design goals easily.
The system is built on top of "Petal" which is a distributed virtual storage system. Due to the inherent capacity of Petal, Frangipani can administer around 2^64 bytes of address space. Data integrity and fault tolerance is achieved through logging and recovery process. The changes in metadata are recorded in a redo log so that even if the server crashes, another server can read the log and bring the system to a consitent state. The data coherence is ensured by making use of synchronization primitives at file granularity. Whenever a lock is granted for a particular request then it is guaranteed that the previous lock holder should have flushed the dirty data if any to the disk (in case of lock contention between readers writers or vice versa). The lock service acts as the lock manager in the system and also makes a deadlock free system. The petal's snapshot mechanism is made use to provide a backup mechanism.
The interesting result is that the system degrades because of prefetching (But it is expected in these type of system which tries to ensure coherence). However this can also be avoided by reducing the granularity of locks.
Pros: (1) Servers can be added or removed from the system very easily. It does not need to worry about its neighbours nor any state of the system. (2) Since Petal provides redundancy schemes, it can be used to increase the reliability of the system. (3) Due to the data striping done at the storage level, the overall throughput of the system is really good by serving multiple requests together.
The system looks more like a parallel database system. Both the types of system share many factors in common. There is plenty of overlapping work in database and system communities.
Posted by: Sankaralingam Panneerselvam | March 18, 2010 02:59 AM