Petal: Distributed Virtual Disks
E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proc. 7th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 84--92, October 1996.
Reviews due Thursday, 3/22.
Comments
Summary:
Petal provides a block-level storage system to upper layer clients such as file systems and databases in the form of virtual disks. It comes with a set of nice features -- globally accessible (concurrently), highly available, scalable, single component failure tolerance, heterogeneous clients support, dynamic load balance, etc. These are achieved by its virtual to physical mapping mechanism, copy-on-write snapshots, and most important, chained declustering data placement scheme.
Problem:
Prior file and disk systems only achieve a subset of the following goals: a globally accessible storage system which could tolerate any single component failure, provide load balance, could be incrementally deployed, geographically distributed and is easy to administer. Petal tries to approximate all these goals in practice.
Contributions:
+ A software RAID like storage system which does not rely on any specialized hardware. Node and network failures are dealt with in software part.
+Clients do not need to do any change other than load the Petal device driver. From the client’s point of view, Petal provides a standard block-interface storage system.
+ Petal could support heterogeneous clients on top of it.
+ Petal could provide a mix of nice features such as single-component fault-tolerance, fast snapshots, hot add/remove physical resources, reconfiguration of a virtual disk.
Flaws:
- Clients such as file systems do not have control over actual physical disk layout, and the paper does not talk much about the mapping mechanisms used.
- Evaluations of scalability only on up to 4 nodes
- Write latency could be high since both primary and secondary copy must be written
- In the chained-declustering replication scheme, every block has two replicas. When sites are geographically distributed to tolerate site failures, it would be better to take the distance from the client to the server into account when selecting the replica for reading. Then it may be more expensive for writing, since the request may travel a long way to the primary replica first.
- The paper does not come up with a good dynamic load balancing scheme or data redistribution mechanism. It may be helpful to consider the consistent hashing and LARD ideas to see if they could be adapted here.
Applicability:
The idea that software solutions with ordinary hardware are used whenever possible instead of expensive specialized hardware solutions is widely used today. This helps reduce overall cost, use off-the-shelf items more readily and make it use to scale. For example, cheap commodity machines are connected together to form a virtual powerful machine in cluster computing instead of a huge supercomputer or mainframe. At the same time, failures are taken into consideration in the software/application framework design. For instance, tasks in MapReduce are relatively short and independent of each other. So it is not a big deal if a task fails, a re-execution of the task will not significantly prolong the job completion time.
The paper also does an analysis of performance bottleneck in Petal, e.g. throughput is mostly limited by CPU overheads. With up-to-date hardware, it is interesting to see what the new bottleneck is at this time.
Posted by: Yizheng Chen | March 22, 2012 12:45 PM
This paper describes the a distributed storage system developed at DEC called
Petal. Petal provides a SAN-like network-attached block-layer storage
interface which offers features including fault tolerance, load balancing,
flexible configuration, and efficient creation of read-only volume snapshots;
it also performs at a level quite comparable to a (contemporary) local disk,
in both access latency and throughput.
Petal employs a clever data layout called chained-declustering to provide
fault tolerance and load balancing. The servers are arranged in a virtual
ring with each block of data stored at two neighboring nodes on the ring, such
that half of each node's data is shared with its left-hand neighbor and the
other half is shared with its neighbor to the right. Thus when a node fails,
its load can be split evenly between its two neighbors; the excess load on
those neighbors can then in turn be shared with *their* neighbors, and so on
round the ring, ultmately leading to an evenly-distributed load across the
system (assuming data blocks are accessed in an evenly-distributed fashion,
which is a bit unrealistic).
One of Petal's biggest weaknesses from my perspective is reliability (or
perhaps, more precisely, durability). While the chained-declustered
organization is clever and has very nice load-balancing properties, as they
point out it increases the likelihood of a two-node failure leading to data
loss, doubling it from 1/(N-1) to 2/(N-1) for a system of N nodes. At their
scale of N=4, a two-node failure doesn't seem hugely likely, but the 2/3 odds
of it leading to data loss are still somewhat unnerving. Additionally, Petal
does not seem to support replicating a given data block to any more than two
nodes, which would become more problematic at larger scales and thus higher
failure rates (I believe systems like HDFS usually replicate all data to three
nodes by default). And finally, though I don't see any specific bottlenecks
that would prevent this, declaring "it'll scale linearly" based on experiments
with configurations ranging from two to a whopping *four* nodes isn't quite as
convincing as one might like.
Posted by: Zev Weiss | March 22, 2012 09:56 AM
Summary:
Petal provides a block‐level storage system to upper layer clients such as file systems and databases in the form of virtual disks. It comes with a set of nice features ‐‐ globally accessible (concurrently), highly available, scalable, single component failure tolerance, heterogeneous clients support, dynamic load balance, etc. These are achieved by its virtual to physical mapping mechanism, copy‐on‐write snapshots, and most important, chained declustering data placement scheme.
Problem:
Prior file and disk systems only achieve a subset of the following goals: a globally accessible storage system which could tolerate any single component failure, provide load balance, could be incrementally deployed, geographically distributed and is easy to administer. Petal tries to approximate all these goals in practice.
Contributions:
+ A software RAID like storage system which does not rely on any specialized hardware. Node and network failures are dealt with in software part.
+Clients do not need to do any change other than load the Petal device driver. From the client’s point of view, Petal provides a standard block‐interface storage system.
+ Petal could support heterogeneous clients on top of it.
+ Petal could provide a mix of nice features such as single‐component fault‐tolerance, fast snapshots, hot add/remove physical resources, reconfiguration of a virtual disk.
Flaws:
‐ Clients such as file systems do not have control over actual physical disk layout, and the paper does not talk much about the mapping mechanisms used.
‐ Evaluations of scalability only on up to 4 nodes
‐ Write latency could be high since both primary and secondary copy must be written
‐ In the chained‐declustering replication scheme, every block has two replicas. When sites are geographically distributed to tolerate site failures, it would be better to take the distance from
the client to the server into account when selecting the replica for reading. Then it may be more expensive for writing, since the request may travel a long way to the primary replica first.
‐ The paper does not come up with a good dynamic load balancing scheme or data redistribution mechanism. It may be helpful to consider the consistent hashing and LARD ideas to see if they could be adapted here.
Applicability:
The idea that software solutions with ordinary hardware are used whenever possible instead of expensive specialized hardware solutions is widely used today. This helps reduce overall cost, use off‐the‐shelf items more readily and make it use to scale. For example, cheap commodity machines are connected together to form a virtual powerful machine in cluster computing instead of a huge supercomputer or mainframe. At the same time, failures are taken into consideration in the software/application framework design. For instance, tasks in MapReduce are relatively short and independent of each other. So it is not a big deal if a task fails, a re‐ execution of the task will not significantly prolong the job completion time.
The paper also does an analysis of performance bottleneck in Petal, e.g. throughput is mostly limited by CPU overheads. With up‐to‐date hardware, it is interesting to see what the new bottleneck is at this time.
Posted by: Yizheng | March 22, 2012 08:57 AM
Victor Bittorf, Igor Canadi, David Capel, Seth Pollen
Petal is a distributed block storage system that provides a foundation on which a distributed filesystem can be transparently constructed. It is a key-value store in the sense that it maps short keys to values (eg, block numbers to blocks), but it provides a different set of expectations and priorities than a system like Dynamo. Its goals are high performance, scalability, strong consistency, availability and reliability in the context of a file system. To some extent, it accomplishes all of these things. It is different from other distributed file systems in that it does not implement the actual file system logic itself and instead allows any file system to be transparently run on top of it.
The basic building block and primary contribution for Petal is the virtual disk -- this is another layer of indirection that allows them to offer other features and abstract away some of the realities of disks and machines. Specifically, it allows transparent addition and deletion of disks and it can survive the failure of a node. Furthermore, it allows reconfiguration of system parameters on the fly and the movement of data between physical machines and disks. Additionally, virtual disks allow the system to stripe the data across machines, giving it higher performance characteristics and, at least in their tests, a near-linear speedup. Virtual disks also give significant power to management tools that would not otherwise be available; some large portion of their API is dedicated to that specifically. Snapshots are also efficiently supported by virtue of copy-on-write.
Consistency is different from other systems we have looked at in that it must act like a physical disk, which is very strongly consistent. Petal uses Paxos to accomplish this, and furthermore blocks writes until total replication is accomplished. This synchrony is useful in some ways, but, at least in a modern system with modern bottlenecks, this would strongly decrease performance, especially when a write-ahead log is already in use. Furthermore, forcing two synchronous writes is especially bad because you need to wait for the slowest of two random seeks on disk. It seems that committing to a write-ahead log and a single server (out of the two replicas) before returning success to the client would be sufficient to make the guarantees necessary, even in the presence of failure.
Petal is based on a scalable design, and although it was tested on only 4 machines, the design seems like it would scale to 10 or even 50 nodes. The limiting factor would be the GMap and VDir which is fully replicated. As total cluster size and capacity grew, this would quickly become a huge overhead to both store and keep up to date.
Petal’s data movement algorithm is interesting in that when a block is being moved, any reads are redirected while changing the bookkeeping metadata, but the writes go directly to the new copy, as blocks are always written completely and never partially updated.
Petal seems flawed in that there appears to be no way to shrink the cluster, and the load balancing is managed by the clients. They load balance by approximating server load with their own queues. This only very weakly reflects the load of the servers, but would work decently assuming a uniform distribution of load and a graceful performance degradation curve. Additionally, it seems returning hints as to which server the data is on would improve load balancing and performance with only a small addition of complexity in the client and no loss of correctness.
As far as applicability to today’s system, many of the key ideas in the paper, such as its additional layer of indirection and interesting reconfiguration properties would translate directly, but details would need to be reworked. Specifically, the paper mentions the system was CPU bound, and Petal notably spent a lot of time in checksums. Today, the CPU would not be the bottleneck, especially with offloaded network checksumming. Depending on the workload and network, the system would be disk or network bound. With application of disk scheduling and the write-ahead log, it would be reasonable that the system’s latency could be reduced to the point in which throughput is the issue (eg, the network would dominate). As far as existing systems, Amazon’s elastic block store provides the same block-level storage abstraction as Petal and provides some replication to allow for component failure.
Posted by: David Capel | March 22, 2012 08:57 AM
Review: Petal: Distributed Virtual Disks
Petal is a distributed storage system designed to provide a highly
available block-level storage system. It provides such service in
large abstract containers called virtual disks. Higher level systems
such as distributed file systems and database systems can be built on
top of Petal. Petal is designed to be incrementally scalable. It also
replicates data so that it can tolerate single node failure, and can
perform dynamic load distribution. It provides these with acceptable
overhead compared to local storage systems, according to the
experiments given in the paper.
The design of Petal is modular. It has modules for each responsibility
of the system such as maintaining global state, providing recovery,
and accessing data. By having a separate module for storing and
accessing data, we can choose which replication scheme to use. I think
it is a good idea, because Petal is supposed to be provide low-level
services for other distributed systems. By having this flexibility,
different clients can use different schemes depending on the
trade-off. Petal can be scaled incrementally by adding heterogeneous
nodes to the system.
The way Petal scales has both advantages and disadvantages, in my
opinion. A disadvantage is that the mappings that translate from
virtual address to physical address need to be changed and data stored
in disks need to be moved around every time a new server is added. It
does not look like a scalable solution to me, because the cost to add
new disk would increase as the number of disks increases. It would be
better if the change only affects some number of servers in the
system. This approach, however, has some good properties. The addition
of new server is handled such that the performance of virtual disks
would increase gradually during remapping process, although the whole
process could be long. The authors accept the procedure to a few
hours. Therefore, I think it is a reasonable system if the frequency
of adding and removing servers is low, but it would not be able to
handle abrupt changes in the workload.
One area that need further development in the paper is how Petal
handles dynamic load balancing. The current prototype put this
responsibility on the clients. The client choose which server to send
request to depending on the pending requests it has on the server. (It
sends to the server with shorter queue length). I think this should
not be the client’s responsibility. In addition, we could achieve
locality if we can do this correctly on the server side. As pointed
out in the paper, this is still an area where active research is going
on.
I think the basic idea of the paper to separate block-level
distributed storage system from higher level systems is a great
abstraction. The design of Petal is also modular, so that the clients
can implement their own data storage and access scheme. In addition,
the overhead of Petal is acceptable, in my opinion, compared to local
storage systems. I think it is a good system if we do not require
rapid scalability.
Posted by: Min Thu Aung | March 22, 2012 08:56 AM
This paper presents a multiple-server setup capable of creating multiple virtual hard drive images which can be replicated multiple times across the servers. The data is accessible to the client block by block so that different file systems can be used on the virtual disk.
The goal of this paper was to create a simple system for storing files on a remote distributed system, which could be spread across the available hard disks of the servers. With this in place, many different clients could store files and be able to replicate them on multiple servers as needed. Another goal was to provide as much flexibility to the clients as possible, such as allowing them to decide on the format used for storing files and what type of redundancy to use for storing the data.
The Petal design allows for many different virtual disks across the servers. The data is stored as individual blocks of hard disk data. Some universal data is kept at all server locations but the individual blocks are kept on usually one or two servers. The virtual disk directory and the global map are two structures that are copied to all servers and kept updated and Paxos is used for keeping global data synchronized. The virtual disk directory is used to get the global map data from information provided by the client. The global map is used to identify the server that actually has the data. The client chooses which server to connect, and it should preferably be one that already has the file and can do all the virtual to physical address translations without needing to contact another server (except for servers with secondary copies in the case of writes). Usually this should be the server that received the initial request if the client has kept track of the "most appropriate" server information provided. If the request is for reading data the secondary server can also handle the request. On a write the secondary server is not allowed to handle the request unless the server with the primary of the copy is not reachable. Another value stored with the disk blocks is an epoch number that represents the version of that block. This allows for taking a snapshot of a virtual disk by keeping around a static version of the disk at a previous time by saving the blocks with older epoch numbers rather than overwriting them. Two bits are also available with the blocks for recovery purposes. The busy bit is used to indicate that a write was in progress and the stale bit is used to indicate that a write was sent to the secondary copy of the data while the primary copy was not available. When a change is requested in the configuration of the virtual drive, the system was designed to slowly apply the changes to not block read/write requests that could come in during the reconfiguration. The drive is divided into three sections during the reconfiguration. "Old" is the data still in the original location, "fenced" is the data that might have been removed in the reconfiguration already and "new" is data that has definitely been moved to the new location. In terms of load balancing, the paper mentions that they had began work on load balancing in which each client would try and keep track of which servers were more busy. They list as future work identifying good methods for additional load balancing in the future.
One thing I wondered about was data that is moved from one block to another in a file system. It seems like certain file systems that use a large number of block moves could be made a lot more efficient instead of requiring the data to be deleted and re-written across the network for instance. Another item I wondered about was the snapshot feature of a disk which seems like it could take up a lot of storage if used often. Limiting the space used could be a concern depending upon the use conditions. The ideas in this paper could be applied to various network file storage applications. Some of the recovery and reconfiguration methods used could be applied to other types of value storage in a server setup. The general idea to keep things simple if as good a choice for the design, like the storing of blocks rather than implementing a file system is certainly applicable elsewhere as well.
Posted by: Daniel Crowell | March 22, 2012 08:56 AM
1. One or two sentence summary of the paper
This paper presents a design of distributed block level storage system- Petal that presents physical devices as a highly-available set of blocks organized into virtual disks. Petal attempts to approximate the ideal distributed storage system (globally accessibility, high availability, unlimited performance and capacity, as well as little manual management) by combining various new features.
2. A description of the problem they were trying to solve (and how the problem solved)
The ideal storage system would have properties such as global access, high availability, incremental scalability, and easy management. However, achieving all the goals is a problem due to real world issues such as component failure, network partitioning. This paper aims in solving the problem of providing a highly available and efficient storage solution with global access to several clients. It focuses on providing greater tolerance to component failures, better configuration and system administration options, and load balancing between servers managing physical disks.
3. A summary of the contributions of the paper
First, Petal abstracts physical disks in multiple servers as a set of virtual disks. Existing file systems or databases can run on a Petal virtual disk, or a filesystem, can be designed to specifically take advantage of Petal. In addition, since Petal provides a block level access to clients, it is useful for supporting heterogeneous clients and client applications.
Second, Petal separates the client’s view of storage and the view of real physical disks. This makes it possible to share the physical resources more flexibly among many clients, and to offer important services such as “snapshots” and incremental expandability in an efficient manner.
Third, the reconfiguration mechanisms of Petal are easy. In Petal, Nodes are allowed to enter and leave dynamically, and the layout of data will be automatically reconfigured within the system to account for any changes in group membership which is tracked in a global state, and is kept consistent via Paxos. In addition, chained declustering is used to provide for dynamic load balancing and site failures.
4. The one or two largest flaws in the paper
Petal provides a block level access and storage rather than file level. Although certain advantages and functionality can be achieved under such abstraction, it suffers some problem that the information about locality of data will be lost which make it difficult to optimize that in Petal.
In addition, the replication scheme used by Petal may not be robust in face of component failures due to its deterministic characteristic, that is, if two consecutive servers fail at the same time, then certain blocks will be unavailable. To make it more reliable, more sophisticated mechanisms can be used.
5. A discussion of how the ideas in the paper are applicable to real systems.
Unlike other distributed file (storage) system, Petal provides block level rather than file level interface. This type of interface is useful for supporting heterogeneous clients and client applications, that is, many different types of file systems and databases can be supported.
Posted by: Junyan Chen | March 22, 2012 06:58 AM