« . Chord: A scalable peer-to-peer lookup service for internet applications | Main | Frangipani: A Scalable Distributed File System »

Petal: Distributed virtual disks

E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proc. 7th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 84--92, October 1996.

Review due Tuesday, 3/29.

Comments

Summary: Petal is a distributed block storage system with a straightforward design. The system relies on three mapping structures, two of which are globally replicated, to translate virtual disk addresses into physical disk addresses for performing reads and writes.

Problem: Network file systems provide storage from a files and directories perspective, but there is a potential need for a distributed system that provides storage from a low-level perspective of blocks. The challenge in implementing such a system lies in mapping between virtual disk addresses and physical server and disk addresses. Furthermore, the system must be able to maintain replicas of data according to the desired redundancy scheme.

Contributions: Petal's contribution is relatively simple. Petal relies on three structures to map virtual disk address to physical disk address: VDir, GMap, and PMap. VDir, a list of local disks, and GMap, a mapping of servers to virtual disk addresses, are globally replicated on all servers using Paxos. A local PMap is kept at each server to map virtual addresses to the server's local physical disks. Adding servers requires updating the VDir and GMap structures, plus data must be migrated to the new server to balance load.

Flaws: First, the need for Petal is not well motivated. The authors argue that managing large storage systems is expensive and complicated, but they do not motivate why network file systems are not sufficient, and why there is a need for a low-level distributed block storage system.

Second, in the case of a failed primary, or failed secondary, Petal simply relies on the alternate server for serving reads and writes. No attempt is made to create a new backup copy of the data, in the event the second original server also fails. While only having a single backup is similar to a local RAID configuration, creating new backups would increase the availability of the system in the presence of multiple failures.

Applicability: Petal primarily seems applicable in the case of creating a virtual disk for use by a single client server. Building a network filesystem on top of Petal seems redundant; why not just build the network file system without using Petal? Recently, ideas similar to Petal have been proposed in the context of cloud storage systems.

Summary: Petal is a network storage system that provides a block-level interface to clients. Containers that act like block devices are replicated over multiple network servers to provide high capacity and reliability.

Problem: Storage systems previous to Petal required a significant amount of manual administration because the systems usually required manual load balancing. In AFS, for example, files are organized into volumes that can only exist on a single server at a time. Even though users typically interact with such a system as a single entity, for administrators there is a strong connection between files and the physical resources used to store them. Furthermore, this correlation of logical and physical resources leads to poor fault tolerance. When a server fails, all the volumes hosted on that server become inaccessible. The level of transparency provided by systems such as AFS decreases the logical capacity and reliability of the system because manual intervention is needed to increase capacity and recover from many failures.

Contributions: While many storage systems store files, Petal instead provides a block-level interface to clients. Under this simplified environment, Petal implements a series of management features not seen in some distributed file systems. First, Petal replicates blocks over multiple servers using a technique called chained declustering. Servers are organized into a loop and blocks stored at one server are replicated at the next server in the loop. Second, resources can be incrementally added to the system to increase capacity. Performance testing suggests Petal performs nearly as well as local disks, however only four client machines were used and it is not clear how the system would perform under heavily concurrent workloads.

Flaws: The authors argue that a block-level interface is the correct approach for its simplicity, though it seems that performance of this approach would inherently suffer due to the file system metadata that is not available. For example, AFS improves read performance by implementing whole-file caching on the client. This is not possible with Petal itself, as Petal has no concept of files. Another potential weakness with Petal is that its replication scheme is highly deterministic. It seems that it would not be hard for pathological server failure to occur in a way that would lead to high unavailability of files. For example, a failure of two adjacent servers would make the blocks on the first server unavailable. In contrast, a scheme that chooses replication sides using some sort of hashing mechanism could provide more flexibility and resilience to such cases. The blocks on one server could be replicated on a set of different servers instead of just one.

Applicability: Petal provides a base platform that file systems could be built upon, but the decision not to encapsulate file system logic inside this distributed system leaves some doubts as to how a real system would actually benefit from the scalability and manageability of Petal. Any real system would still have to implement a file system which, to enforce security policies, would probably need to be located on a trusted server and not on individual clients. Furthermore, some shared file systems use locks at the file level to maintain consistency of files. Since Petal only locks at the block level, inconsistencies could easily result if file systems do not implement an additional layer of locks on files. The lack of any serious concurrency testing in the evaluation of petal also leaves doubts as to the applicability of Petal to large systems with many concurrent users.

Summary
Petal is a distributed storage system that provides block-level access for clients. Petal combines physical resources into virtual disks to provide this highly-available abstraction to clients.

Problem
Large storage systems can be complicated to manage. Ideally, the mapping between client-visible addresses and the physical disks is flexible to allow reorganization; however, many current systems support only a static, pre-specified configuration. Because these systems are subject to failures, it is necessary to provide some fault tolerance, both in terms of reconfigurable layouts and also data redundancy.

Contribution
Petal provides a disk-like interface with block-level access. Clients view the storage system as a collection of virtual disks; RPC is used to access the service. When a client accesses a virtual address in Petal, the virtual disk directory translates this into a global map identifier (an epoch number is also included to provide snapshot capabilities). Based on this id, a particular global map determines the server responsible for translating the data; the address is then translated into a by a physical map at the specified server.

When the system is reconfigured, data is migrated for better load balancing. A new global map that incorporates the new server is created, and virtual disk directory entries are updated to refer to this new map. Data is then distributed based on these translations. This process happens incrementally to so as to not interfere with client tasks. In particular, small portions of the virtual disk are migrated at a time, using a “fenced” region to denote the portion currently being moved.

Petal uses chained-declustered data access to provide highly-available data and also redundancy. Within a virtual disk, data is replicated on a neighboring server. This replication can be used to perform load balancing, either after a single server failure or dynamically during normal operation.

Flaws
Petal can only tolerate a single server failure. If either neighbor of this failed server goes down, data will be lost. The authors make no argument about these failures being transient, so it seems that some form of re-replication would be useful here to tolerate additional failures.

Relevance
Distributed storage systems are an ideal solution for storage needs. By providing a moderately fault-tolerant, easily reconfigurable system and block-level abstractions, Petal addresses common issues in these types of systems. Although they address differences between Petal and current distributed file systems in the related work section, it’s a little unclear what the actual implications of using Petal underneath a file system versus a distributed file system are.

Summary

The paper describes Petal, a distributed system which presents clients with "virtual disks", which are block devices that are transparently implemented by many physical disks residing on remote servers. Petal is designed to tolerate failures of disks or entire servers, and can scale to more disks and servers in a manner that is transparent to clients.

Problem

There is a need to provide persistent networked storage which is resilient in the face of disk and node failures. Networked storage should have performance which is competitive with locally attached disks, but which adds the redundancy and scalability usually associated with distributed file systems. Support for efficient backup and recovery is an additional design goal of Petal.

Contributions

Petal achieves good scalability and performance by separating data which must be shared and globally consistent amongst all servers from data which is local to particular servers. A virtual disk directory (which maps IDs of client-visible virtual disks to IDs in the global map) and a global map (which maps offsets in a virtual disks to a physical server which is responsible for hosting them) are replicated globabally and maintained via an implementation of Lamport's Paxos algorithm (in the prototype system, these global data structures were only about a megabyte in size). Each individual server keeps a local physical map which translates (global map ID, offset) pairs to a physical disk and offset. Because physical disks are managed entirely locally, adding a new physical disk requires no global synchronization, as the individual server can use it to fulfill new storage allocation requests and transparently relocate data to it in the background for load balancing.

Adding a new server requires creating a new global map that includes the new server and redistributing some data to the new server. Data redistribution is accomplished by moving small "fenced" chunks of a virtual disk at the time, with the rest of the disk in known "old" or "new" regions that can be deterministically serviced by the correct version of the global map.

Data is stored redundantly in a chained-declustering scheme. Each individual block of data has 2 copies stored on "neighboring" servers. Each pair of adjacent servers have some blocks in common. If a server fails, its 2 neighbors can service requests for all its data. Furthermore, those 2 nodes can offload some of their requests to their own neighbors, with this cascading offloading redistributing load relatively evenly in the event of server failure. The authors note that by placing even-numbered and odd-numbered servers in different data centers, the system can be made to tolerate the failure of an entire data center without data loss.

Flaws

Data on failed servers is not automatically re-replicated when a failure is detected. As such, the failure of 2 adjacent servers will cause data to become unavailable. Adminstrator intervention is necessary to restore the safety of the system in the case of a non-transient failure.

Also, it is not at all clear that a block device is a good abstraction for networked storage. Firstly, multiple clients attempting to access the same virtual disk concurrently are bound to cause corruption of the filesystem, unless the FS is specifically designed for concurrent access and some out-of-band synchronization protocol is adhered to (this is certainly not the case with the conventional local filesystems the authors use in example use cases). Conventional filesystems also tend to be tuned for good performance on a particular type of local storage hardware (e.g. arranging data for spacial locality on hard disks, or doing log-structured writes on flash). Because all block addresses used by the client filesystem are virtual and reveal no information about their physical layout on server disks, any efforts to lay data out in a certain way are wasted and may even be counterproductive to performance. An actual distributed filesystem is able to address both of these issues by incorporating file-level synchronization and controlling physical layout on servers.

Applicability

Petal's main contribution appears to be an intelligent seperation between a minimal consistent global state and a larger per-server local state. This keeps sync overhead low and allows individual servers to make locally optimal decisions without having to involve other servers. On the other hand, providing clients with a virtual block device instead of a virtual filesystem seems unneccessary and counter-productive for both performance and concurrency.

Summary
Petal is a system that provides a virtual disk to clients, which is backed by by a network of servers that store the actual blocks.

Problem
Managing large storage systems is difficult. Complicated systems fail more easily, so replication and fault tolerance is important. For many workloads, demand is also high so load balancing is important. Administrative tasks, such as adding space and doing backups is also more difficult for large storage systems.

Contributions
The main concept of Petal is to provide the abstraction of a virtual disk, which is actually backed by several Petal servers on a network. Since the abstraction is a disk or just a block device, it is very flexible. Existing file systems or databases can run on a Petal virtual disk, or a filesystem, such as Frangipani, can be designed to specifically take advantage of Petal. Petal exposes a sparse address space, the full 64 bits can be addressed. This simplifies the task of adding or replacing storage; Petal handles all mapping of virtual addresses to physical blocks on nodes. This abstraction, with a little additional support from Petal, also allows for fast snapshots of the virtual disk. The Petal servers only need to create new mappings and use copy-on-write to create a snapshot. Petal is also homogeneous; there is no master server, all that is required is that a majority of the Petal servers be accessible, and the system can still operate. Finally, Petal uses “Chained-Clustering” for replication. This scheme has the same characteristics as mirror, except that the mirrored blocks are offset so that if a node fails, its load is shared between both its neighbors instead of all going to one node.

Flaws
The main problem with choosing the block layer as the abstraction is the loss of information. Locality information is lost. A distributed filesystem like GFS tries to keep files close on the network to where they are used, often storing the files on the same node that accesses them. This would be difficult to do in Petal. Since Petal does not know anything about the filesystem running on it, Petal snapshots are crash consistent. If the snapshot mechanism is built into the filesystem instead, as in ZFS, snapshots can be fully consistent. Petal’s data redistribution seems a bit crude in comparison to a system like AutoRAID. Since data is moved immediately, the redistribution takes a long time and cannot be done on the fly, however Petal’s fencing mechanism alleviates this problem somewhat. It seems that there are at least three different concerns when dealing with ditribution and replication of data: write load balance, read load balance, and protection against failure. Different workloads may have different requirements for these three issue. For example, GFS and ZFS allow the replication level to be set, even to the granularity of individual files.

Disscussion
Today systems generally either use a primary filesever with some kind of RAID system for storage, or if the system deals with a huge amount of data, it might use a highly distributed filesystem like GFS. Petal seems like a replacement for a RAID backend since it provides the same abstraction of a block device (perhaps with the exception of the sparse address space). However it’s not clear that Petal is a good alternative; in this paper it was setup on expensive machines, used expensive disks, and ran on an expensive network. The authors even suggested using NVRAM for better behavior. All these expensive components are what make big RAID systems undesirable. The push today is to use big, cheap disks, cheap commodity servers, and inexpensive Ethernet. Further, since disks are so large today, it may be possible for one server to hold all the storage that an organization or department needs.

Summary
This paper presents Petal, a distributed storage system that presents physical devices as a highly-available set of blocks organized into virtual disks. Petal attempts to approximate the ideal distributed storage system by combining various new features.

Problem
Large, distributed storage systems can be difficult to implement. Ideally, they would have properties such as global access, high availability, incremental scalability, and easy management. However, given real world problems such as imperfect networks and partial failures, achieving all of these is a difficult problem.

Contributions
Petal provides a distributed block-level storage system. It is implemented on top of some number of physical disks, and abstracts them into a number of virtual disks. These virtual disks are accessed by clients via RPCs. Petal provides a simple method of mapping virtual addresses to physical addresses, and this process can be done by anyone in the system – there is no central server that deals with locating resources.

Petal allows nodes to enter and leave the system dynamically, and automatically reconfigures the layout of data within the system to account for any changes in group membership. Membership is tracked in a global state, and is kept consistent via Paxos.

Petal also provides a mechanism to optionally replicate data in order to improve availability and perform backup. It does so using chained-declustering. In this system, data on one server is replicated on a neighboring server, so if a server dies, requests for its data can still be handled by the neighbor.

Flaws
I like some of the services Petal is able to provide with this abstraction. However, there are some flaws it produces. As they mention, with this kind of abstraction, you lose information about locality when storing data, so it would be difficult to optimize that in this system. On a larger scale, I thought they did a poor job of motivating the use of this abstraction. While it can be used to provide a simple 2-layer file system with Frangipani, I'm not convinced the simplicity it provides makes up for its disadvantages.

Application
I was skeptical about Petal's applicability in today's computing world, for several reasons. Their design makes a few assumptions that I don't agree with. For example, they assume a highly reliable network, which isn't really valid in most networks. For security, they must assume that Petal is only running on trusted nodes and operating systems. This belies a larger problem; it seems that today distributed systems are trending towards large networks of heterogeneous, commodity PCs, often times which are not under direct control of the system providing the service. In these situations, you can't make assumptions like “trusted operating system” without greatly limiting your options.


This paper presents a design of distributed block level storage system- PETAL that tolerates component failures, dynamically balances load between different underlying servers, transparently expands in performance and capacity on addition of new nodes without significantly increasing the cost of managing the system.

The main advantage of a block level storage system is that it can be integrated into existing computers and transparently support most existing file systems. There are many existing block level storage systems already in market. But most of those have a specific view of the underlying system (ex : have static configuration at the start of the system which cannot be changed later, cannot distribute data across multiple nodes) or they make some assumptions about the underlying infrastructure (ex communication interconnect is reliable). PETAL is designed to address these problems by providing a virtual disk interface over the underlying storage system. Performance of PETAL is equal to local disk access performance for small data reads. However large data reads or writes lead to slower performance of PETAL compared to local disk accesses.


Features of the PETAL:

1. Maintains a global view of the data at every single server which makes every server ready to address every client request
2. Since it provides a virtual view of the system, it is easy to transparently deal with addition or deletion of nodes, heterogeneous clients.
3. Provides generic framework which can be tuned to application requirements (ex redundancy level and storage scheme for data placement)
4. Its usage of chained de-clustering helps in dynamic load balancing in times of disk or component failures
5. Separation of translation data structures into global and local physical maps allows to keep the bulk of the mapping information local which minimises the amount of information that must be kept in global data structures. Amount of global data maintained at each node is around 1MB.
6. Modular structure of the PETAL ( liveness module, global state module, recovery module, data access module, virtual to physical address translator) provides scope for applications using it, to change the modules according to its requirements.

Flaws

Since it virtualizes underlying storage system, many of the application level optimizations cannot be done.
Ex: File systems views data at file level where as PETAL sees it at a block level. A file system using PETAL as an underlying storage system cannot expect it to store data of a single file in single node.(locality of data is lost). Frangipani uses its own locking service in-spite of locking at PETAL level.

One of the main assumption for providing load balancing is that, the client knows the load at the servers and will route request to the server which is less loaded. Maintaining this information at the client will be time consuming task and that it is suitable for clients which live in the system for long time.

Global data has to be replicated at each node, so maintaining it will require a lot of communications.

Applications

GFS was also of uses the same idea - seeing data as a set of chunks compared to blocks as this paper presents . The idea of providing virtualization has been in use for years now - memory virtualization, LOCUS - distributed operating systems, network file systems, Virtual Machines.

Summary:

This paper describes a storage system called ‘Petal’. This system is a collection of network servers that manage a pool of physical disks and provide the abstraction of a block level virtual disk to a client accessing the system.

Problem:

This paper aims in solving the problem of providing a highly available and efficient storage solution with global access to several clients. It focuses on providing greater tolerance to component failures, better configuration and system administration options, and load balancing between servers managing physical disks.

Contributions:

One of the important contributions of this paper is the separation between the client’s view of storage and the view of real physical disks. This design feature helps in sharing resources more conveniently and provides for incremental expandability. The system provides a block level access to clients, thereby allowing the use of heterogeneous clients. This also allows one to build a distributed filesystem to be built on top of the interface provided by Petal.

Another important contribution provided by Petal is a mechanism to automate the process of backup and recovery of a client’s data stored in the system. It does this by making an efficient snapshot of virtual disks based on a copy on write mechanism.

The other important features of this paper are easy reconfiguration mechanisms for disk additions and usage of chained declustering for distribution of data. Chained declustering provides for dynamic load balancing and site failures. The load balancing provided is better than that provided my mirroring or striping.

Applications:

1. The usage of block level interface in the petal system is very useful making it flexible. Google Filesystem provides a block interface, which is very useful.

2. Petal could be a base framework on which reliable distributed file systems could be built.

Flaws:

1. Petal uses chained declustering while it provides for dynamic load balancing, it provides lesser reliability than mirroring. If any one of the neighbors’ of a server fails it incurs a loss which would not happen in mirroring.

Summary: Petal is a network based distributed block level storage system that creates an abstraction of virtual disks over a set of servers with physical disks. It provides this block level access with features of fault tolerance and load balancing.

Problem Statement: Managing many GBs of data in multiple disks can be a big hassle due to the limitations in per disk capacity and increasing rates of data growth.  One way to solve this is to use big RAID systems that themselves pose limitations in load balancing and scalability. Another solution is to create an abstraction of a really big disk which is very similar to a small disk in functionality. Such an abstraction can be created with inclusion of fault tolerance by using ideas used in write ahead logging and RAID, and careful partition of data can provide a highly available system.

Contributions:

  • abstracting multiple physical disks in multiple servers as a set of big virtual disks using two layers of indirection. Firstly a global map that identifies the server and a local map that identifies the actual block on disk
  • providing a block level access rather than a file level access provides easy portability of the existing file systems and applications. Although file level access has its own set of advantage if applications are built on top of such abstraction
  • incremental reconfiguration to add servers/disks to the existing system. The design choices make this feature slightly slow/inconsistent during reconfiguration process but the refined algorithm solves it
  • chained declustering to provides higher availability and ability to partition data across different data centers providing access to data in case of failure of one center. But this added feature reduces the robustness by mirroring to only two servers and also increases write latencies

Flaws: The choice of block level access looks good since it directly supports existing file systems, but it can easily suffer from internal fragmentation. File level access can eliminate many such issues but would require an extra translation layer to keep existing file systems unmodified. Amazon elastic block store built over s3 provides this feature, since s3 stores data only at file level. File level access can support file versioning more intrinsically than at block level, since we are creating a version of an entire file rather than a multiple blocks. This also helps to support incremental snapshots where the next snapshot only needs to store information about the modified files and reuse the remaining part from the older snapshot. Paper also fails to describe anything about local caching that can improve performance like AFS.  

Relevance: Really big storage containers are becoming more and more important with the increase in the rate of data creation. Petal provides a really good way of creating an abstraction of large containers that is both fault tolerant and available. Amazon S3 and Google Storage are examples of large distributed containers available for application developers to use the virtually limitless space in cloud.    

Summary:
This paper describes Petal, which aims to be an 'ideal' distributed storage system, which is always globally accessible, transparent, scalable and requires little management. It achieves this by exposing abstract containers called virtual disks rather than physical disks, and by providing a block-level access, which opens up a lot of interesting opportunities.

Problem statement:
Networks had gotten cheaper, which made distributed storage solutions desirable to centralized storage, because of all the nice properties of distributed systems such as fault tolerance, scalability etc. But managing large scale storage systems still needed a lot of effort for setup and management, and also possessed consistency issues. Petal aimed to fix these problems.

Contributions:
1. The idea to abstract out physical disks and expose virtual disks is a great idea. Also, by providing a block level access to these virtual disks rather than at a higher level decouples the application/FS implementation specifics from the system itself, thereby making it easily extensible. This also makes the storage system effectively manageable and configurable.

2. The idea to use epoch number along with GMAPs makes backup and snapshotting easy. I liked the idea of consistent and crash-consistent snapshotting.

3. Incremental reconfiguration of the vdisks and request redistribution upon addition of new physical disks, which make the system scalable. Making the client responsible for load balancing was a bummer though.

4. The idea of chained declustering, which provides replication but avoids potential hotspots and achieves balanced load.

5. Pluggable redundancy and replication schemes based on how the storage system use cases.


Problems:
1. No way to provide 'custom' replication for certain files, since Petal operates at a much lower level.
2. I wonder how scalable the security system is. The authors briefly mention this, but this can become a nightmare as the system scales to 1000s of users.
3. How crash-consistent are the crash-consistent snapshots anyway? If an entire vdisk takes a long time to snapshot, we might not have a consistent snapshot.
4. Its not clear from the paper whether the global maps are replicated or not. If they are not, then the gmaps become potential single points of failure.

Applications:
Amazon EC2 cloud platform uses machine images to create new instances. They also provide the Elastic Block Storage EBS for non-volatile and backed up storage. Both of these concepts seem very much inspired from Petal. Also, I think ideas from Petal are applicable to almost all distributed storage solutions.

Post a comment