Petal: Distributed virtual disks
E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proc. 7th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 84--92, October 1996.
Review this or Frangipani for Thursday, March 18th.
Comments
This paper presents Petal, a distributed virtual disk implementation which is designed to be as close to the ideal as possible, which means globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. Petal is designed to approximate this ideal through a novel combination of features. These features include a number of network connected disks which provides a highly available virtual disk to the user. The service is able to tolerate and survive disk, server, and network failures. They evaluate the performance of their system on some very very old machines (225 MHz DEC 3000/700 workstations).
The design of petal includes virtual to physical translation of addresses, support for backup, incremental reconfiguration, and data access and recovery. All of these features are to make the system as close to ideal as possible as mentioned above.
I think this paper is applicable to distributed systems. The idea that the user can create a virtual disk whenever they need one is a little bit strange because I think of that as being something that a user doesn't and doesn't want to understand. They simply want to create files, editing them, and have them stored somewhere where they are accessible the next time they want them. But distributed file systems are widely used today at universities and in industry. Petal performs its abstraction at the disk level by acting like there are multiple disks, but it really is just a distributed file system.
Posted by: Elizabeth Soechting | March 18, 2010 02:00 PM
This paper presents Petal, a replicated network file storage system that is easy to maintain as it grows.
The authors wanted to create a system that is 1) easy to manage, 2) scalable, 3) available, 4) fault tolerant to single failures, 5) fast, 6) performs load balancing, 7) easily expandable, 8) geographically distributed, and 9) available to heterogeneous clients.
The majority of Petal involves gluing existing protocols and ideas (such as Paxos) together to form a system that meets the criteria above. However, they did make some design decisions that were helpful in achieving their goal:
1)Virtual disks hide the actual storage details and allow the data to be stored on multiple servers.
2)A disk-like interface that works with files on the block level. This allows files from various types of file systems to be stored here.
3)Backups are virtual snapshots of the data that appear like any other virtual disk. They can be created efficiently due to copy-on-write techniques.
4)Chaining data placement replicates the data on neighboring nodes and can help lower the read load.
While this paper makes some great claims, it is apparent that there are some disadvantages to working with the system. The block level interface would make it so heterogeneous clients wouldn't be able to correctly read one another's data (apparently this is where Fragipani comes in to play).
There are also limitations as to which nodes can crash. For example, if a server node and its neighbor both go down, we loose access to the data on the server node (even though its other neighbor is still up).
It would be interesting to see how the performance of this system compares to other types of network attached storage (e.g. AFS) instead of just comparing to local disk performance.
Posted by: Sean Crosby | March 18, 2010 01:59 PM
The petal paper gives the design and implementation of a distributed storage system. Unlike distributed file system, petal manages the data as raw blocks giving an abstraction of virtual disks to client. These virtual disks could be then consumed by clients such as a distributed file system or databases. The paper then addresses how petal goes about solving the common problems of distributed system such as node failures, reliability and load balancing.
The contribution of this paper is a new approach in design of distributed storage by separation of system cleanly into block-level system and file system. This approach has an advantage of serving heterogeneous client applications and solving the issues of distributed storage at a lower level transparent to the clients. A Virtual disk is addressed by a virtual disk - identifier which is then translated into physical location (Server, disk offset) by global information shared among petal servers. Petal performs snapshot back-up by maintaining immutable versions of updates so that client could failback to an earlier one in case of crashes. Addition of servers to the system is handled in a phased manner through redistribution of data from the existing ones. The maintaining of global map for a virtual disk makes the redirection easier to handle among the servers, transparent to clients. For data recovery, petal maintains a chained de-clustering to mirror and strip data blocks of virtual disks. Two copies of data blocks are maintained at primary and secondary servers. Reads are done at either of them while updates are applied originated at primary and propagated to secondary server.
This system manages the reliability of data at block level through kernel device drivers. Additional hardware support for such a systematic operation of automatic back-up to secondary could be utilized to increase performance. Another approach would be to perform block level backup without the involvement of the petal servers. A dedicated backup server could perform such a movement of data using high speed dedicated links without the utilization of client network. However since the system performs a block replica, hints from client applications cannot be utilized to determine the importance of data. For example the system backs up the metadata information of a database and the temporary files at the same importance, while it is unnecessary.
Posted by: Rajiv | March 18, 2010 01:56 PM
Summary:
Petal is a distributed storage system that can be used to emulate physical disks accessed by file system clients. Virtual disks are used to replicate data, allow for scaling and redistribution of data for load balancing, and backing up. It can be used with modified existing file systems or with new distributed file systems such as Frangipani.
Problem:
Distributed file systems present many difficult challenges, such as trying to maintain availability, consistency, reliability, transparency, etc. Many solutions that we have seen so far sacrifice at least one of these. Petal attempts to maintain availability, consistency, and transparency by implementing the distributed storage at the data block level, which allows further abstraction from the user.
Contributions:
The idea of distributing at the data block level is rather novel compared to most of the distributed storage systems we have encountered so far in this course. Other solutions implement the distributed characteristics at the file system level. Instead, petal implements these systems at the storage level, with the file system acting as a layer of abstraction between the user and the distributed nature of the system. One of the most interesting aspects of their design is the use of Paxos to maintain a global state of the system across each node. This allows for fault tolerance in the case of network and server failure. Another interesting system is the virtual to physical mapping system. Petal maintain three layers of mapping: virtual directories, global maps, and physical maps. Each layer is responsible for translating to the next layer, and the global map translation is performed on the server that contains the data block (which means the physical map translation also occurs there). This is one of multiple optimizations included in the design to reduce the amount of network communication required. Petal also supports a system of checkpoint backups. Each checkpoint maintains a version or epoch number, which allows for data recovery. One of the issues with checkpointing at the time of the paper is that it required pausing Petal for ~1 second in order to be fault tolerant. Data in Petal can be replicated or striped.
Application/Thoughts:
The other paper for today about Frangipani shows an application of Petal for implementing a fault tolerant distributed file system. I had a few questions about the Petal paper, however. What is the storage overhead costs of implementing the distributed system at the block level? Does distributing at the block level cause large differences between random vs sequential reads/writes?
Posted by: Ryan Johnson | March 18, 2010 01:55 PM
Goal:
To build a fault tolerant distributed file system to achieve at least the equivalent performance of local disk file system and present a "block device" interface to client to seamlessly work with existent FS interface.
Summary:
To give the right abstraction level of distributed file system to tolerate fault and take advantage of increasing number of servers.
Contribution:
(1) The idea of "virtual disk" that gives the same block-device view to client compared to other systems that present a higher level abstraction. A pagetable like translation was used to map between virtual disk address to physical address.
(2) Incremental reconfiguration algorithm was proposed to facilitate reading when data redistribution occurs.
(3) Chained-declustering algorithm was used to reliably recover and achieve load-balancing.
(4) A quorum like protocal was used to achieve consistency.
Application:
The paper has mentioned that this work could be used as the building block as the distributed file system and and database system. And later work such like Frangipani did make use of it. I also believe that database system should use it too, but I just overheard that DB people wanted to get away from OS file system interface as much as possible(because OS people have done too much general purpose optimization for it) because they know what kind of optimization they should do and OS preempts such opportunities. Power issue has been a hot topic now, but it seems that 15 years ago, it was not a big concern although everyone envisioned "scalable" but the setting was not really that big to hit the heat wall.
Posted by: Wei Zhang | March 18, 2010 01:54 PM
Petal aims at providing a highly available and high performance block storage system to clients by using distributed storage servers. Petal's design is scalable, tolerant to server faults, and is also manageable.
It consists of a set of distributed servers providing block level storage. Clients communicate with these servers via RPC. Clients see the system as a collection of virtual disks. A client-issued virtual address needs to be translated to the physical location where the data is stored. This indirection makes it easy to add or remove hardware, and handle faults efficiently. Petal also supports backups of virtual disks. These backups have a distinguishing epoch field in their ID, which makes them unique. Petal uses chained declustering to organize its data, which helps in cascaded load balancing and fault tolerance.
The main takeaways from the paper are the idea of distributed virtual disks, the virtual-physical translation mechanism, incremental reconfiguration and chained-declustering. Petal operates at block-level which keeps things simple, and its support for heterogeneous clients running different filesystems and applications is a huge advantage.
People say many problems can be solved using more levels of indirection, and this paper is faithful to that quote. I liked the overall design of Petal and I think its simple and practical, and offers a wide range of useful storage features. The other paper describes Frangipani, a distributed filesystem built over Petal. Frangipani machines share a base of Petal virtual disk, and access is regulated to this Petal disk.
Posted by: Chitra | March 18, 2010 01:54 PM
The paper describes Petal, a distributed storage emulator that adds a degree of indirection to the normal operation of storage systems so as to transparently allow access to a distributed network of disks on multiple servers to a set of clients.
The problem that the paper is trying to address is one of creating a network of storage systems and servers that can behave coherently as a single large storage device with multiple drives partitoned init just as any physical device would be. It works at a level lower than a distributed file system and hence can be used to plement such a file system more easily by pushing down failure handling, replication and consistency checking to the lower level.
In summary, the paper describes many interesting aspects of this distributed virtual disk system. While the translation mechanism is interesting by itself, the addition of Epochs allowing the system to maintain multiple versions (at least 2) simultaneously with all reads/writes defaulting to the latest one is a very elegent design. Secondly, the mechanism used to create these back-ups - only use copy-on-write schemes - is extreamly fast and has almost no overhead at the risk of imminent failure of that very disk wiping out the data even though it had been marked for a snapshot. Very simplistic yet useful redundancy schemes are provided. Another concept introduced is that of chained declustering. While the similarities to RAID style striping are obvious, what this scheme also allows is load balancing during failure which is a not so obvious advantage with the term chaining referring to the ability of the load from any server to be trasferred to any other server.
The usefullness of such a system that works at a level lower than a distributed file system is surprisingly huge. While it seems to do what many distributed file systems would do, what it deosnt enforce is the naming and data access schemes of a file system. As a result, such a storage system can be used by any sort of data access service, whenther a key/value store, a database or a file system thus making this system ideal for data-center based applications.
Posted by: Sanjay Bharadwaj | March 18, 2010 01:49 PM
Summary:
Petal is a distributed storage system, which aims to provide the abstraction of virtual disks over a pool of physical storage disks. This abstraction gives us some nice properties, and makes several online operations, such as load distribution, possible.
Problem:
Given a bunch of disks, is it possible to abstractly provide a reliable, reconfigurable storage system on top of it? If it feasible to consider a distributed _storage_ system instead of a distributed _file_ system? What kind of performance could be obtained and what optimizations would be needed to make performance comparable to local systems?
Solution:
Petal cobbles together standard techniques for various sub-problems, such as Paxos, and epoch numbers. The chained declustering, and the "fencing" were the most interesting parts to me [ maybe simply because we haven't come across these techniques yet ]. The idea of incremental disk reconfiguration also seems very nice [ apart from the mechanism of fencing used to solve it ]. I would be very interested in how today's storage area networks and other data center technologies handle this.
Contributions:
Considering pushing down reliability and other distributed system headaches into the storage layer instead of the file system seems an innovative approach. If this was really the first paper that introduced this idea, I think it can forgiven for the lack of originality in the rest of paper, considering how important the idea is now for data centers, and every huge-scale company.
Comments/Questions:
1. I fail to intuitively get an understanding of why Petal might give the same performance as a local disk, considering the replication that needs to done, and the in-direction inherent in every access. If not for the other paper Frangipani, where a system was actually built using Petal (seemingly), I would be skeptical if it actually worked or not.
2. The performance evaluation seems very weak - Any system that makes claims of scalability should be evaluated on at least a dozen nodes, not just 4. Also, I am wondering how much the fact that they used an ATM network matters to us. Will it make operations faster?Slower? Easier? All I can remember is that ATM uses fixed sized cells, which seems kind of irrelevant.
3. Was this actually meant as a full paper? The size of the paper and lack of detail in many areas makes me think this was a poster or an idea paper or something. In this case, my criticism above about evaluation is null.
Posted by: Vijay Chidambaram Velayudhan Pillai | March 18, 2010 01:49 PM
Problem Statement
Achieving high availability, fault tolerance, dynamic reconfiguration, load balancing and good performance in Storage Area networks using Virtual containers/Disks
Petal is a collection of network connected servers that manage a pool of physical disks. It appears to the client as a highly available block level storage system providing virtual disks. Petal has some nice properties - can tolerate and recover from component failures thereby providing high availability, can be reconfigured on demand , supports heterogeneous clients and provides uniform load and capacity throughout the system. The idea of virtual disks spanning multiple nodes is a powerful concept that allowed Petal to have the above properties.
Petal server consists of a global state manager which coordinates with a liveness module that follows a heartbeat like protocol. Global state manager uses Paxos for maintaining the global state of which servers are alive. Petal servers also consists of a virtual to physical translator that converts to using three data structures in each virtual disk– Virtual dir, Global map and Physical map. Global map is a list of physical servers on which virtual disk data is distributed. New global maps are created at reconfiguration. On reconfiguration, load is dynamically balanced by incrementally moving fenced data (a fence is a part of the disk) from old server to the new one while continuing to serve requests.
Petal lets virtual disk be configured with any kind of redundancy/replication scheme. Hence separate data access and recovery modules exist for each redundancy scheme. Periodically virtual disk snapshots are taken and stored for recovery purposes. Read requests is routed to the replica with the shortest queue length. Write requests are sent to the primary replica and write to primary and secondary is completed before replies are sent to client from the primary. If primary is down, writes are done in secondary replicas and marked as stale for primary to sync up when it comes back.
Contributions
1. Petal was the first distributed block level storage with virtual containers or disks. Virtual disks can span multiple nodes
2. Petal was the first system that supports transparent addition and deletion of nodes to existing storage containers in the face of arbitrary component and network failures. The system level performance of a single container scales gracefully when nodes are added.
3. Can support any kind of single component failures- at the node level, at the disk level and at the network level
Applicability to real world systems
Since this storage system provides the required distributed systems properties such as availability , fault tolerance and dynamic load balancing in the face of reconfiguration of the network of servers and failures, this can used be underneath a distributed file system. Virtual disks can be configured with any kind of redundancy scheme and can support heterogeneous clients making its widely applicable in any storage area network.
Posted by: Ramya Olichandran | March 18, 2010 01:49 PM
Summary:-
----------
The paper presents the design of Petal, a distributed block-based storage system.
Main goals of petal design are scalability, availability, dynamic adaptation to failure, load balancing and fault tolerance. A Petal system is organized into a set of Petal servers, each of which handles some physical disks. As with many systems, Petal uses indirection to achieve its goals. Petal exposes virtual disks to users. Virtual disks are configured with the required level of redundancy and the set of servers over which to be placed.
Petal servers use a heartbeat protocol and majority consensus on group membership. The consensus mechanism based on "Paxos" algo is also used to maintain global state(virtual to physical mapping tables, current virtual disks). Petal achieves backup by COW-based snapshotting using an epoch number in the translation tables. Petal allows dynamic virtual disk reconfiguration, addition or removal of servers, disks. Currently, Petal supports chained-declustered replication mechanism between servers. This allows more read bandwidth, chained load offloading, and good load sharing in case of failures. However, writes need to update both the copies and incur overhead. Petal servers also maintain log and journaling to achieve durability.
Problem Description:-
----------------------
While research has focussed on distributed file systems, block-based distributed storage provides an interesting alternative that readily integrates with heterogeneous environments and file systems and provides the benefit of distributed systems like scalability, fault tolerance, easier management.
Contributions:-
----------------
1) Idea of virtualized distributed storage.
2) Chained-declustered mechanism of block replication.
Relevance:-
------------
Petal feels like RAID formed across a distributed system. Petal allows easier design of distributed file systems like in the case of Frangipani. However, Petal incurs overhead in exposing virtual disks and blocks the propagation of disk-level information to the application.
My thinking is that it makes sense in case of a distributed system made of heterogeneous components but clusters are generally homogeneous. If so, Petal might not be used because of the additional overhead.
Posted by: Laxman Visampalli | March 18, 2010 01:49 PM
The problem targeted by Petal is to provide a distributed storage system that provides availability, reliability and consistency in the face of individual component failure. Petal achieves this by providing the clients an abstraction of virtual disks while it manages the underlying set of physical disks.
The system consists of a set of storage servers that each manages a set of disks. Clients are exposed virtual disks that they can configure (size, redundancy profile, etc.) according to their needs. The system internally manages the mapping from the virtual disks to the physical disks while providing load balancing, snapshot based backup, mirror based availability and fault tolerance. The idea of using chained-declustering to increase availability and to provide load balancing seemed interesting to me. The system allows incremental addition of adding disks / servers. On addition of new hardware, it dynamically redistributes the data and the load onto the new set of disks by partitioning the data into old, new and fenced regions. It also mentions the use of Paxos algorithm to maintain global state.
The system presented in the paper seemed to be a bunch of existing ideas stitched together in a novel way. However, the idea of virtualizing the disk itself seemed to be an interesting concept. Since Petal provides a distributed block-storage system, it could be used as the substrate on which other distributed file systems can be built on.
Posted by: Deepak Ramamurthi | March 18, 2010 01:31 PM
Problem Statement:
The problem that this paper addresses is that of providing a highly available block level storage system that is fault tolerant, manages load distribution.
Summary:
PETAL, the block level storage system described in the paper consists of a set of servers and phsical disks connected over the network that present any petal client a view of a single highly accessible virtual disk that can tolerate disk,server and network failures. It employs a a bunch of distributed computing algorithms to acheive this such as replication, load balancing to improve the response time of the system, consensus algorithm to agree upon which of the phyiscal disks are currently available etc. The main difference between PETAL and the distributed file systems we have read about so far is that PETAL provides a block level interface. So any file system that uses block level calls can use PETAL.
Contributions:
Implementing a virtual disk is a novel idea. The paper clearly describes the processes and the tate each server maintains to identify the actiual physical disk for any operation.
-The chain clustering algorithm replicates any piece of data between the adjacent disks. Hence the back-up servers can be easily identified for any piece of data.With this data replication mechanism the load can be distributed effectively in the event of a disk failure. To some extent the clients also help with the load balancing. Another advantage of using the chian clustering tecniue is the geographic separation of the physical disks for additional fault tolerance.
-The snapshotting technique used to create the back ups.
-Incremental reconfiguration to add/remove the servers from the system.
Applicability:
The PETAL system offers a block level interface to heterogenous clients. While the results presented look convincing, I do not see how it us usefu in a general deployment scenario.
Posted by: Neel kamal M | March 18, 2010 01:16 PM
Petal is a software approach to setting up a distributed file server. What sets it apart from other solutions is its use of virtual disks as a layer of abstraction to build features onto.
They describe the problem as an issues with managing and setting up storage systems as expensive, time consuming, and complicated. They go on to say that failures often render the system unusable and take time to recover from.
Pegging cost-effective scalable networks as a the main medium to construct a system that meets their design goals they save on hardware by creating petal as a software solution instead of designing any hardware. A pool of distributed storage servers work together to provide a virtual disk view to all the clients involved. My favorite part of petal has to be the virtual to physical translation idea. I mean this already is proven an effective way to manage memory and disks are just non volatile.
There no glaring problems with real world applications of petal, I mean they even demoed it themselves. However, there self proclaimed security issue between clients would certainly need to be addressed. Writing it off as trivial is fine so long as it makes it into a real implementation. As a side note, reading about the hardware used in their implementation is a little weird. It's hard to relate to 4.3 GB SCSI drives and 225 MHz DECs.
Posted by: Jeff Roller | March 18, 2010 01:16 PM
Problem
The problem is to provide highly available block-level storage system to clients as virtual disks while providing easy maintenance to administrators. The virtual disk provides the client to freedom of choice of their own file system.
Summary
The system automates management of multiple servers and disks. There is clear indirection layer that the system provides virtual disk id to the client and hides everything else in the system. In the backend, it supports epoch number for creating a snapshot. By having an indirection layer, it can support incremental reconfiguration that any new server and disk can join to the network. Also, it provides high availability by storing two copies of data into neighboring servers.
Contribution and comment
Petal approached in radical way while most researchers had focused on distributed file system such as AFS, NFS, Coda. Other than that, designs and implementations mix-and-matches existing ideas.
One of the interesting things is that they evaluated in ATM network which was usually used in backbone, not in lab environment. If they evaluated Petal in 10MBPS Ethernet, they may conclude distributed file system is better than theirs for practical usage. So, I think it is useful to think outside of current environment for system researches.
Applicability
SAN, iSCSI seem to be intellectual derivation of Petal. They share same idea that provide virtual disk to the client. Having virtual disk instead of file system has clear advantage over today’s virtualized server environment. However, I guess its usefulness is far restricted than nowadays when Petal just released. Providing block-level storage has higher overhead than network file system.
Posted by: MinJae Hwang | March 18, 2010 11:38 AM
Summarization:
This paper introduces the design and implementation of Petal, a block-level distributed storage system that provides large abstract virtual disks. Petal contains five different modules: a liveliness module based on majority consensus recording the liveliness of each server; a global state module based on Paxos algorithm for ensuring consensus and progress of system states; a virtual –to-physical translator that translates virtual disk address to physical address on each server while reducing unnecessary communication; data access module and recovery module implementing the chained-declustering algorithm for fault tolerance and load balancing.
Problem Description:
The problem this paper tries to solve is to design a highly available block-level distributed system that is fault tolerant and incremental expendable, while its performance should be at least not worse than a single local disk system. Before Petal, there are numerous file systems that address some of the four goals, as discussed in the related work part, but it seems that Petal is the first system that addresses all these goals. The fault tolerance, disaster tolerance, incremental reconfiguration ability, load balancing capability and backup/recovery support renders Petal as an important file system work at that time.
Contributions:
One of the most appealing ideas to me is the incremental reconfiguration algorithm. Since the design of Petal contains the epoch number implicitly, it is easier to find which block of data is newer. Therefore, data redistribution could be easily achieved by moving chunk by chunk strictly following the reverse order of epoch number. However, this would be likely to cause double communication time if client wishes to read data during the redistribution phase. The improved algorithm further refined granularity of the address and made each range as either old, new, or fenced. Only the fenced data will have the possibility to increase communication time. This modification easily solved the reconfiguration problem while maintaining acceptable performance.
Another contribution is its chained-declustering algorithm for data recovery. The clever idea lies in the chaining data placement strategy. Such data placement not only allows disaster recovery (by putting even number servers and odd number servers in different geographic location), but also allows load balancing. Each server could offset its load to its direct neighbors, and the cascading effect would cause the load to be almost evenly distributed.
A third contribution is the idea of virtual disk itself. Using a virtual disk could easily allow applications to share the capacity of disks, and it has the ability to create volumes larger than a single disk.
Real Applications/Problems:
As indicated by this paper, Petal could be deployed to places where a fault-tolerant, global accessible file system is need. The incremental reconfiguration ability made Petal very easy to maintain.
However, I still have several concerns about the performance of Petal. First, in order to guard against fencing off a heavily used range of disk, the refined reconfiguration algorithm actually chooses non-contiguous ranges to copy each time. It seems that this operation will create a lot of fragmentation, which will probably degrade the performance. Second, the primary and secondary disk needs a strict synchronization when a writing command issues, this will in some sense lower the performance of writing data. Third, the design goal of Petal is to save as much states on server as possible, while making clients maintaining only a few of the mapping states. But other papers (like Dynamo) mentioned that in order to build a scalable system, clients need to do as much work as possible instead of the server. So in this aspect, is Petal really scalable?
Posted by: Shengqi Zhu | March 18, 2010 10:26 AM
Problem Description:
This paper presents the design and implementation of the PETAL storage system. The main aim of the PETAL system is to provide an abstraction to the clients that each client has its own private disk to access and to provide high availability, better performance and easier reconfiguration of the storage system.
Summary:
The PETAL distributed storage system supports heterogeneous clients and provides an abstraction of private virtual disk for each client wheras it internally manages multiple disks to support multiple clients. PETAL stores all the state information in servers and only hints in clients to make the client stateless. It uses Global state module and liveliness modules to handle challenges of distributed systems and uses separate data access module and recovery module to handle data and system crashes. It also has a virtual to physical address translation using virtual directroy, global map and physical map. It supports backup using epoch based algorithm and provides incremental reconfiguration of disks. It internally uses chained declustering in which data block is replicated in different servers so that it can tolerate server failures and also by cascading the offloading of the existing requests among all the other available servers, it provides better performance.
Contributions
1) By separating the storage system into block-level storage system and filesystem, the scalability and maintenance can be improved
2) The complete state information is maintained in the server which makes the client thin and easy to handle failures and recovery
3) By using chained - declustering, it provides high availability of data and better performance
4) The incremental virtual disk reconfiguration scheme provides better performance eventually without affecting the current performance.
5) PETAL uses separate modules each performing its own function separately which is better in terms of maintenance.
Comments and applicability:
The things that I felt novel in the paper were providing the distributed storage model and a filesystem upon it rather than a distributed file system and the idea of chained declustering to provide better performance rather than simple replication model. Other than these two ideas, others were pretty obvious. Also I feel that having two levels of redirections might affect the performance. The system proposed appears to be better in terms of maintenance and scalability but not sure if this can perform better or not.
Posted by: Raja Ram Yadhav Ramakrishnan | March 18, 2010 09:10 AM
Goal:
Petal provides distributed fault-tolerant storage by using block-level interface. This allows any file system on client machine to utilize this storage. Additionally, chained-declustering is used so that load balance can achieve when failure or reconfiguration occurred.
Problem:
Distributed file system normal done by developing complex file system that utilize replication on top of local disks. Additionally, these FS have different policies that are suitable of different kind of task and most system can only adopt one system due to administration overhead.
Contribution:
Petal exposes its distributed storage by using block-level interface. This allow client machine use any file system to manage Petal storage. Client storage driver contacts Petal servers using RPC over TCP/IP.
Internally, Petal use Paxos to maintain consistency for its global state information. It separate mapping into 2 levels. Client only access data through virtual disk and offset. The global table maps a virtual disk to physical servers. The local mapping in each server provides additional mapping into physical disk.
Chained-declustering is a simple replication technique but it has good properties that adjacent nodes have an overlapping data. This allows the system to propagate and distribute load with a node fail. Read request can be serviced by primary or secondary replica but write must go through primary in order to obtain the lock.
Storage performance and consistency is achieved by using write-ahead log on a dedicated disk. It also provides snapshot capability by using copy-on-write mechanism. Performance on small read/write is comparable to using local disk. Major overheads of the system are due to transfer time and checksum calculation.
Application:
Although Petal is a distributed storage; however, its architecture is a bit centralized because it required dedicated servers for acting as storage machines. However, this kind of setting is very useful for data center environment because it is flexible in terms of management and usability. Additionally, network bottle will not be a problem in LAN setting. Follow up work which is Frangipani also shows that distributed file system can be built easily on top of Petal compared to traditional distributed file system. However, it still needs to be deployed in trusted environment because of the lack of security model below virtual disk granularity.
Posted by: Thawan Kooburat | March 18, 2010 08:37 AM
Problem :-
To provide a scalable, distributed storage system that is tolerant to individual failures while providing a highly available and easily manageable set of virtual disks on the top of several physical storage devices.
Summary :-
The paper presents an distributed storage system that provides virtual disks to clients using a collection of physical disks. The paper aims to provide a system that tries to minimize the complicated issues faced while managing large storage systems. The proposed system automates the device management process for the user while providing load balancing, fault tolerance to failures and high availability through replication. Physical devices can be easily added/removed from the system while trying to ensure gradual performance change in such scenarios.
Contributions :-
The system provides a scalable storage system in form of virtual disks on which multiple file systems can be implemented transparently without bothering about the underlying physical devices. So, it provides a useful building block to system designers. I partly felt that it was like the Dynamo paper where a bunch of interesting ideas were combined to form a novel system. It provides an example of an application for the Paxos protocol for maintaining the global state. It discusses techniques for virtual to physical mapping while minimising the network communication. It uses epoch numbers to create snapshots incrementally while ensuring writes to occur on the latest copy. The use of "old, new and fenced" region for reconfiguration was interesting. It uses chained-declustering for load balancing and tolerance to failures.
Applicability to real systems :-
The Petal concept is useful for cases where storage needs to be scaled flexibly according to changing requirements. Petal can be used as a building block for distributed file systems that are easy to scale and maintain. The Frangipani system uses Petal as a basis to build a scalable distributed file system. But, most of the individual techniques presented in the paper have been discussed elsewhere too.
Posted by: Ashish Patro | March 18, 2010 08:04 AM
This paper presents the design and implementation of PETAL, which is specific for distributed systems. PETAL offers a uniform view of virtual disks for all the users with fault tolerance and dynamic load balancing.
PETAL is designed for distributed storage. The problem PETAL tries to solve is how to hide the heterogeneity of clients’ files systems and provide a global view of virtual disks with high availability and fault tolerance. To be scalable, the system needs to be easy to add or remove storages. Meanwhile, users will also prefer simple administration and easy configuration.
In PETAL, server maintains most of state and clients only maintain hints. If most of servers are up and right for communications, it is assured that the system can recovery from random server failures and network disconnection. PETAL maps virtual ID to global ID. Server is identified by global map. Server translates the global ID into the physical disk and real offset by physical map. PETAL offers fault tolerance by using chained-declustered data access. It separates odds and even servers to tolerate site failures. The two contiguous servers of one server have all its contents, so it is easy to recovery from server failures.
This distributed storage system fits the demand of real world applications. It offers a uniform block-level interface for heterogeneous machines with all kinds of files systems. PETAL keeps redundancy data for availability and recovery. I like the general idea, however, I expect the authors to talk more about how to make different functionalities work well together. If the authors can talk more clearly about the approach to scale like how to add or remove components, this work will be better to convince me that it is applicable.
Posted by: Lizhu | March 18, 2010 07:39 AM
Summary:
This paper introduces petal, a block-level virtualization platform, which provides client with large, scalable, fault-tolerant virtual disks. The petal is a three layer structure: client could only get a view of virtual disks; the virtual disks are created from a storage pool; the pool consists of physical disks. Besides, petal also provides COW snapshots to support consistent backup.
Problem:
I think this block-level virtualization is a typical storage device in SAN environment. This separation of block-level storage system and file system enables it to support different types of file systems and heterogeneous clients and applications.
Contributions /novel idea
1. Delay the allocation of physical resource: this lazy allocation mechanism could enhance the utilization of disk space.
2. Virtual disk could provide higher space capacity than single physical disk. The adjustment of capacity could also be more flexible.
3. Simple block-level interface, easy to use and high scalability.
Practical Applications/ Comment:
1. Compared with LVM, I could only find two advantages for Petal: A. Delay allocation mechanism will save space cost. B. Provide different chunk size choice for virtual disk. Since the disk now is much cheaper, it seems only the second points is still working. I think the “Franipani” has used this functionality to optimize the disk layout. But the author should provide more test results with different chunk size. Because larger chunk means higher I/O performance, but may waste space, while small one will save more space but reduce the I/O performance.
2. I am wondering how petal will solve this problem: suppose we have created some virtual disks and store some data in them. After sometime we delete one of them and create a new one. The deleted virtual disk will free some space in physical disks and since other virtual disks are still working, the allocation of physical resource for this new disk should be with much fragmentation. If the user keeps creating and deleting, all the virtual disks’ mapped physical chunks will be fragmented, which cause a significant I/O performance loss. Does the Petal have a solution for this problem?
3. I am also curious about the implementation in Petal: if we want to fake a block device in linux kernel, we need create a gendisk structure with a queue. The queue is used to convert the logical address to physical device address by modifying the bio->bi_bdev and bio->bi_sector and forward it to the real block device. This work is done in generic block layer. Then the bio will come to the elevator algorithm for that physical device. I’m wondering since different virtual disks may forward their bios to the same physical disk at the same time, is there any mechanism in petal to optimize the I/O by enhancing their spatial locality before submit to elevator ? Or else, the elevator algorithm may not work very well.
Posted by: aoma | March 18, 2010 07:14 AM
The paper describes Petal as a distributed system with fault tolerance and scalability in data storage. The main motivating force to the Petal’s designers might be the challenges of administering large scale distributed file systems and added complexity in scaling servers as well as storage space.
Petal design is aimed at providing a distributed storage system that can tolerate component failures in a manageable manner with balanced distribution at geographical locations along with load balancing.
Petal implements virtual disk concept with large and sparse virtual space. The disk storage is provided on demand and the system provides accessibility to all file servers over the network. A scalable interconnection network is created in the system and virtual disks are implemented by cooperating processing units attached to ordinary disks. This sort of design provides ease in incremental expandability of petal’s underlying storage and data can be mirrored over multiple servers when required. Snapshots are organised by epoch number and the server keeps track of vital information as the client do not have any information about it. Petal utilizes lazy allocation scheme in which, space does not get allocated until a write operation. Petal provides a block level interface and existing software working on block level can be used rather then developing a new software.
The general observation is that the system works well for small reads and writes but for larger block sizes the performance degrades. There is much latency associated with write operation in comparison to reads, also the wait time for logs is considerable.
Most of the contemporary file systems assume that they have sole access to disk volumes while Petal considers the point of multiple clients accessing the same disk volume concurrently. Petal’s design gives an image of distributed storage system rather then distributed file system. It appears to the client as a collection of virtual disks. The system is easier to model, design and implement due to flexibility associated with block level interface in comparison to file systems. Snapshots act as immutable copies of virtual disks. Petal’s mechanism of transparent addition and removal of servers is pretty appreciable.
Posted by: y,Kv | March 18, 2010 06:29 AM
Summary
The authors implemented a distributed storage system which abstracted a set of physical disks into one or more virtual disks. The virtual disks appear as a regular block writable storage space but have many beneficial attributes due to the distributed implementation, such as fault tolerance, expandability and scalability, and load balancing, among others. The authors later developed a distributed file system on top of Petal (Frangipani) discussed in a later paper.
Problem
The aim was to develop a highly available, fault tolerant storage system with the capacity for “unlimited” performance and capacity. Usefulness was also a key consideration: competitive latency and bandwidth, and ability to use it without requiring application changes.
Contributions
Petal provided one of the first (if not the first) distributed storage systems that handled failures at the disk, network, and node level. Most other distributed storage/file systems handled a subset of failures and/or assumed some higher degree of reliability (e.g. reliable networks). Noteworthy is the online scalability where Petal servers and disks could be added without taking the system down and with (presumably) minimal impact to the clients through incremental redistribution.
Applicability
I think there are some choices made that are still relevant today and/or were good choices in general. The fact that the storage system is block writable means it can be used with most existing file systems and virtually all applications unmodified (often an important consideration if the technology would like to gain widespread acceptance). Petal provides numerous benefits transparently (e.g. fault tolerance) much like RAID, without exposing complexity to the user. The choice to do copy-on-write snapshots can be seen in some newer file systems (Sun’s ZFS), which I believe to be a useful tool.
Posted by: Jesse Benson | March 18, 2010 06:26 AM
Summary:
The paper points out that the maintenance of large storage systems is complicated and requires lot of manual monitoring, moving, partitioning files and directories. Petal is their solution and its virtual disks separate clients view of storage from physical resources used to implement it. This makes replication easier and thus achieve fault tolerance and high availability. The system is implemented using several modules. Virtual to physical translator module helps in mapping clients view to physical view, data access and recovery modules control data distribution and global state module with help of liveness module guarantees continuous and consistent operation of the system even in case of transient failures. Petal provides block level interface and this helps handling heterogeneous systems gracefully. Software solutions are preferred because of fault tolerance provided although hardware solutions provide better performance. Read write latencies of the system are slightly larger than that for locally attached disk but the system scales linearly with addition of servers and degrades gracefully in case of sever failures.
Contributions:
Support for heterogeneous environments. Clear distinction is made between storage and application level. Lower level(Storage) abstracts out the complexities involved in distributed system and makes the job of upper levels(applications) easier. It supports all kind of applications – file systems or databases.
Availability vs reliability trade-off: Chained de-clustering helps load balancing and isolates failure as opposed to reliable mirroring techniques.
Support for incremental scalability by splitting address range into old, new and fenced.
Applicability:
Implementing fault tolerance in software offers good fault tolerance makes maintenance easier. Features such as incremental scalability and graceful degradation and support for heterogeneous systems look impressive. But I think it forces some of its design choices on upper levels and is not very flexible. For example, it forces block level access control and so as Frangipani paper points out, implementing file level locking would require extra overhead of implementing distributed lock server and this requires communication across servers and can increase network contention. Also, I think performing two lookups(one in global map and one in local physical map) for every access is inefficient, although it helps reduce size of global map. May be TLB like data structure can be used to speed-up access for recently visited blocks. But there will complications of invalidating TLB entries on reconfiguration/failures.
Posted by: Satish Kotha | March 18, 2010 06:23 AM
Summary:
This paper describes a fancy distributed storage architecture—Petal. Consisting of a pool of physical disk, Petal appears as a highly block-level storage subsystem that provides large abstract containers which is called virtual disks. It provides both functionalities to exploit the entire capacity and performance of underlying physical resources and the flexibility and scalability that servers and disks can be incorporated into Petal.
Problem:
The problem which Petal addresses in the paper is how to provide a highly available block-level storage partitioned into virtual disks by collaborating a bunch of physical disks and servers.
Contribution:
- Petal maintains all state on servers, and hints on clients.
- It uses Liveness module to ensures that all servers will agree on the system operational status. It uses majority consensus and periodic exchanges of “I’m alive”/”You’re alive?” messages.
- Data access and recovery modules: Control how client data are distributed and stored. Data are simply stripped w/o redundancy. It uses chained declustering to distribute mirrored data in a way that balances load in the event of a failure.
- The author implemented address translation module which is for converting virtual address to physical addresses .
- Backup: Petal uses snapshot for backup. Snapshots are immutable copies of virtual disks and created by using copy-on-write
- Incremental reconfiguration: Used to add/remove new servers and new disks
- Petal provides a set of block-level storage interfaces which makes the system easy to model, design, implement and tune. In addition, block-level interface is useful for supporting heterogeneous clients and applications.
- chained-declustering: In stead of simple mirrored redundancy scheme, Petal uses chained-declustering, which distributes the data to the neighboring nodes in the chain.
Benchmark:
The author measured latency of a virtual disk and throughput of a virtual disk against some micro-benchmark. In addition he used a modified version of Andrew Benchmark to test the performance of some basic operations from filesystem level. I think the benchmark and results make sense. In the meantime, I think it will be more persuasive if they add some more benchmarks which involve some real workload like mail server, OLTP.
Questions:
- Building a block-level storage system is simpler and more flexible than FS level system. While some companies like IBM, EMC have already provides high performance block-level storage sub-system. I didn’t see any point in Pedal than is superior than those commercial products. Can make some comparison here?
- In Petal system, it uses ATM network to connect clients to Virtual Disk. What’s the difference between ATM and SAN for such connection?
Posted by: Deng Liu | March 18, 2010 05:48 AM
Problem addressed:
This work ties to provide an illusion of highly available, globally accessible block-level storage device, that can sustain failures.
Summary:
The authors proposes to build a single block-level storage system out of a pool of distributed storage servers. Client is presented with a view of collection of virtual disks. This virtual disks are then mapped on to physical disk(s) via three level mapping. The first level of indirection (called virtual disk directory) maps the virtual disk identifier to a global map identifier and the associated redundancy level . The global map then determines server responsible for translating the given offset. There are generally two servers for that, primary and secondary, to tolerate failures. Finally the selected server uses local physical map to translate the global map identifier and offset to physical disk and offset.The virtual disk directory and global map are system wide data-structure and are kept consistent via Paxos like distributed quorum based protocol in presence of faults. It uses epoch number to tag the version of snapshots to support taking quick backup. Another interesting aspect of the system is chaining of data placement on the servers where neighboring servers replicate data and thus on a server failure, its nearby servers (in terms of logical numbering) share the responsibility of serving the data for failed server.
Short summary:
This paper proposes to build virtual disks out of distributed cluster of physical disk that provides better availability in presence of failures. Basically it uses replication and global indirection to read/write data in presence possible failures. It makes use of Paxos protocol to keep globally consistent view of the mappings that maps user supplied virtual disk id and offset to actual physical disk and offset.
Relevance:
It may be of some relevance to provide reliable block storage device. Different applications like distributed file system can be built on top of Petal ( e.g. Frangipani). I think applications like database can also treat it as reliable block storage device for its purpose of having a relaible persistent storage. But I found the paper itself not that novel or interesting to read. It provides only limited reliability, as due to cluster chaining if either of the two neighboring servers of a failed server fails, then there is possibility of loss of data.
Posted by: Arkaprava Basu | March 18, 2010 04:26 AM
Summary and Description:
The paper presents a distributed storage system. The system restricts itself to providing a view of virtual disks, and file system abstractions are expected to be built on top of the virtual disks. The system acheives fault-tolerance and load balancing through replication, and can be configured for each virtual disk to provide different levels of replication. Addition of new hardware to expand the system, and reconfiguring the degree of replication can be acheived transparently. Along with the usual goals of fault tolerance, load balancing and heterogenity that inspire many distributed systems, the system proposed in the paper tries to reduce the level of manual intervention required to manage different components.
Contributions:
The paper does not provide any novel ideas. Rather, the main contribution of the paper is combining a set of existing implementations to provide a solution to the problem of virtual disks over a distributed system. In addition, I believe that the level of abstraction the paper tries to introduce (that of distributed virtual disks, instead of file systems) was innovative.
Thoughts:
The system in the paper has been used in some distributed file systems (Frangipani), and so can be considered useful. The design presented in the paper can be used as a driving guideline for other similar systems built in the future. However, it seemed to me that the paper did a pretty bad job of explaining how the different components connect together to form the entire system.
Posted by: Thanumalayan Sankaranarayana Pillai | March 18, 2010 01:53 AM
Petal describes a distributed file system that has performance scalable with the hardware. As such it distributes the meta data and provides a bunch of useful features. Other than that this paper just describes the implementation problems and optimizations.
Problems with complexity and administration costs make distributed file systems a pain to deal with. Petal tries to automate most of the administration to solve this problem. Simply handling the physical file systems as a part of the virtual petal file system solves the problem of complexity. Petal therefore solves these problems fairly well.
The main contribution of this paper is in setting up a system that is not as fragile as AFS. The data is stored in a redundant manner, so that failures can be tolerated. And the meta data is distributed so nothing in the system is centralized, allowing for great scalability. The measurements of the performance are weak, so telling whether they accomplished something miraculous is hard to see. Further measurements can show that the performance at scale is also a great contribution.
I see the four server test as very non-indicative of what the system should support. This paper lays out the idea that the system will scale well, but without even trying eight nodes they may not realize a problem with the design. I see that this is an affront to our own project where we will be simulating a distributed system with but four nodes, but that is for a class project that will most likely be whipped up just several nights before the due date. Petal is trying to pass off their results as worthy without a solid test. Maybe I'm being harsh, but at least Frangipani used seven and plans to go to 100.
Posted by: Jordan Walker | March 18, 2010 01:37 AM
Summary:
This paper presents Petal - a distributed block-level storage system. Petal enables fault-tolerance, load balancing, and incremental recongifuration for heterogeneous environments. It is by providing a block-level interface rather than a file-level interface that Petal is able to gracefully handle heterogeneity in the system. Petal authors also argue that a distributed file system can be build on top of this distributed disks network.
Problem:
I think the problem that Petal really tries to solve is enabling cost-effective and scalable storage systems in the presence of heterogeneity. The virtualization concepts come in handy when separating the client's view of storage from the physical resources that are used to implement it. Effectively, this layer of abstraction allows the physical resources to change (fail, grow, shrink) without affecting the clients.
Contributions:
The authors claim that their contribution is a novel combination of already existing ideas for providing a storage system. Certainly none of the components presented in this paper are novel by themselves, and even the combination I thought was pretty standard. I thought chained-declustered data access and recovery modules were the most interesting. I was not sure for what duration the read access locks the block and whether read accesses to the same block are serialized.
Comments:
I am torn between saying that this paper is reinventing the wheel and saying that this paper really solves the heterogeneity problem of distributed systems by simply enforcing distribution at the right level of abstraction. It would be nice if the authors described how they envision a distributed file system being built on top of this distributed virtual disk network, and how this compares to the existing state of the art distributed file systems. This would have certainly made their case more plausible. Other than that, I think that the concepts and techniques used are very similar to what we have already seen for distributed file systems.
Posted by: Polina | March 18, 2010 12:23 AM