« |Dynamo: Amazon's Highly Available Key-Value Store | Main | The Part-Time Parliament »

Disconnected Operation in the Coda File System

J. J. Kistler and M. Satyanarayanan, Disconnected Operation in the Coda File System, Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, October 13-16, 1991, pages 213-225.

Reviews due Thursday, 3/10.

Comments

Summary
The Coda File System is a distributed file system in which clients are provided a high level of access in the disconnected state. By hoarding (caching) both user-designated important files and also recently used files, Coda attempts to provide the user with a local copy of the files they might need during the disconnect. Once connectivity has been restored, Coda reintegrates the local changes made during disconnect with the global state on the servers.

Problem
In a traditional distributed file system, users often do not have access to necessary files during a disconnect. While files are often cached locally for performance reasons, there is no guarantee that the local files are those that a user may need to access during a disconnect. Ideally, the file system would operate such that most, if not all, of the files the user will request during disconnect are present in the local cache. Further, the file system will perform seamless integration once a connection has been reestablished.

Contributions
The Coda FS allows work to continue in a disconnected state, and performs reintegration once the connection is re-established. Whole files are cached locally. In the standard state, called hoarding, Venus, the cache manager, hoards files in the cache, in anticipation of disconnect. When the cache needs to evict a file, low priority objects--based on user-defined hoard profiles and recent usage--are evicted first. To maintain equilibrium, Venus periodically does a “hoard walk” to ensure that no uncached object has a higher priority than a cached object--this helps ensure that Venus will be able to service the client’s requests should a disconnect happen.

During disconnect, Venus attempts to service client requests using the local cache, and records mutation information in the replay log. Once the connection is reestablished, Venus sends the replay log to all available servers on which each volume is stored. It’s possible that reintegration fails, however, if conflicts cannot be resolved.

Flaws
The authors suggest, using actual traces, that conflicts are rare because it is infrequent that two users modify the same file with any frequency. However, it seems like they are poorly set up to resolve conflicts if they do arise, especially in the case of trying to store a file for which there are conflicting storeids. Although manual repair is possible (and is common in other systems we have studied), it goes against the transparent approach and may cause the reintegration process to be quite time consuming.

Additionally, the authors comment on the possibility of exhausting non-volatile storage, either with modified files in the file cache, or with replay logs. This exhaustion leads to a state in which the user is no longer allowed to mutate files. Although they suggest avenues to explore, this seems like a large issue--in a long-duration disconnect, exhausting these resources would lead to a complete stop of (non read-only) work.

Relevance
The Coda FS addresses a critical issue in distributed file systems: how to maintain services for the client in the case of disconnect from the server. In the voluntary disconnect situation, Coda allows the user to prepare for disconnect using the hoard profile; the authors suggest that users are capable of accurately predicting their file access needs in this case. In the involuntary situation, Coda attempts to maximize the likelihood that the “right” files will be present in the local cache. Coda provides a good example of a relatively simple solution to continual access and demonstrates how user input can be utilized to solve these predictive access issues.

Summary: Coda explores the usability of shared network file systems from portable workstations that are often disconnected from the network. The system revolves around a strategy for caching relevant files while the workstation is online and then sending any changes back to the network servers when connectivity resumes.

Problem: Transparently providing shared file services on a portable workstation is hard because usage patterns on the workstation can be unpredictable. The workstation may lose network connectivity at any moment, for reasons both within and beyond the user's control. Once disconnected, the user and system processes may access arbitrary files and expect them to be available. Perhaps more worrisome, users on independent disconnected workstations may modify the same files but expect the server to eventually store a single, combined version of the files. This problem is important to power users who want their work, and files, available everywhere. If disconnected operation was both important and challenging in the early 90's, it is perhaps even more so today given the diversity of possible network connections and mobile devices.

Contributions: Unlike some other related systems, Coda strives to provide an integrated cache for both on- and offline operation that is transparent to the user. Due to the difficulty of predicting user and system behavior, there is some lack of transparency. To address unpredictable disconnection behavior, Coda periodically updates its cache but also allows to user to manually trigger an update before a disconnection. To handle user file access behavior, users can manually specify files they want cached, while Coda augments this list with a history-based caching algorithm. Coda attempts to reintegrate changes back into the server transparently to the user, but concurrent changes to the same files must be resolved manually. Coda was evaluated in a setting where most workstations were used for software development.

Flaws: Coda's largest flaw is that it is misguided in seeking a high level of transparency, particularly for the application on which it was evaluated. As a result of this goal, there is limited manual control over the caching and reintegration processes. Users cannot control when reintegration occurs, as this processes is triggered when manual connectivity is established. A user may be in the middle of a complex change and not want to be forced into the conflict resolution process just by physically reconnecting his machine. Furthermore, users cannot revert local changes they make, even though the information needed to to so is implicitly stored in the replay log.

Applicability: Given the complexities of collaborative software development, a system that tries to hide some of this complexity but fails to do so would likely only create more problems. Curiously, the authors observed a low occurrence of file conflicts even for development projects, but this could be a result of the way this particular group organizes their source code. A transparent caching scheme in the style of Coda may be useful, however, to manage collections of media files. Consumers want to access collections of movies and music on a variety of devices, but typically don't modify such files, which would eliminate the need to handle conflicts.

Summary
The paper describes 'disconnected operation mode' for clients of a shared data repository which enables them to operate even in events of temporary failures that render the repository inaccessible. This mechanism, which is experimented with Coda file system, uses client side transparent caching of files which are deemed critical for operation to mask failures to access shared repository.

Problem
Distributed file system or centralized storage system improve reliability of data besides simplifying collaboration and administration. But placing data remotely from client can impede its progress when the remote storage system is inaccessible. The paper presents a file caching mechanism that is transparent to the client and affords access to critical data when shared repository cannot be contacted.

Contributions
The client invisible whole-file-caching mechanism, presented in the paper, improves both performance and availability of data to permit disconnected operation efficiently while providing benefits of shared data repository. Even in the absence of failures, the callback supported caching policy provides consistent and latest copy of data potentially allowing collaboration among multiple users with high performance.

Paper uses multiple techniques to ensure that available caching space is judiciously used. The use of explicit hoard database (HDB) permits user to control which files get cached, also implicit caching policies derives benefit from temporal locality as it caches recently used files. The paper describes a priority based cache management algorithm to efficiently use limited amount of cache space to store more useful data.

The implementation of 'disconnected operation' discussed in the paper uses transactional access to cache metadata and atomic replay of Replay Log to achieve correctness of operation and consistency among all server replicas.

Faults
Disconnected operation with local caching would only work when pattern of data access can be predicted for satisfactorily long duration of failure.
The client side caching presented in the paper is only useful when size of working set needed for successful operation is smaller than available cache space at client's machine.
This cannot be used for applications which exhibit highly concurrent and fine granularity data access, since Coda uses whole-file-caching. Caching at finer granularity will only cause acceptable conflict resolution to be impossible to achieve in all cases.

Applications
The disconnected mode caching mechanism can very well be used in cases where clients of shared repository have fairly independent workspaces and central repository is used primarily to improve reliability. In such setup, availability and performance benefits of disconnected operation are easily realized.

Summary

This paper tackles the problem of allowing a client to continue accessing
critical data during temporary failures of a shared data repository, by caching
files on client machines.

Problem

How can users continue to work even when remote repository is
temporally inaccessible?

Contribution

- This paper exploits caching for both performance and high availability, while
preserving a high degree of transparency. To allow disconnected operation, this paper presents a three-step procedure: hoarding, transparent
reintegration or conflict detection.

- This paper adopts quite a few AFS techniques, and uses them appropriately. For
example, callback-based for cache coherence, whole-file caching, and placing
functionality on clients.

Flaw

I would argue that providing high degree of transparency is kind of a flow. It
would be better that users can control the content in the cache. There are two
examples illustrating this.

Example 1: On a cache miss, the default behavior of Venus is to return an error
code. A user may optionally request Venus to block his processes until cache
misses can be serviced.

Example 2: It is possible for Venus to exhaust its non-volatile storage during
emulation.

Possible solution: If users can control what to cache, and what not to cache,
then they do not have to suffer from cache miss and resource exhaustion. For
example, as in subversion, users can "checkout" a specific directory explicitly.

Applicability

The caching idea for disconnected operation can be found in modern
software. There are two in my mind.

1. Dropbox, especially when using iPhone's Dropbox app ...
2. Version control tools, for example, git -- we "git-commit" locally, and
"git-push" later.

Summary:
This paper describes the systems for disconnected operation in Coda, a distributed file system descended from AFS. It focuses heavily on the caching policies needed to ensure a reasonable quality of experience when not directly connected to the file system.

Problem:
Ideally, when operating on a machine connected to a distributed file system, one should be able to continue working when disconnected from the server (either voluntarily or due to a remote failure). To properly allow this, the file system needs to carefully manage its cache to maximize the chance of having useful files available when disconnected.

Contributions:
The paper’s primary contribution is its examination of the functionality needed to operate offline, including its hoarding system, server emulation, and reintegration policies. Hoarding refers to the cache management policy used in Coda, which combines priorities specified by the user (in a “hoard profile”) and priorities based on recent access to determine which files to keep and which to evict. Coda manages the state of the hoard in a hoard walk, which ensures that the user’s cache is as current as possible in the event of a sudden disconnection by refreshing broken call-backs and fetching items in the hoard profile that had been previously evicted.

Server emulation and reintegration are both more straightforward. The former occurs when the client is disconnected, and involves careful logging of operations to facilitate reintegration. Reintegration occurs when the client reconnects to the distributed file system, and involves a straightforward replay of the client’s operation log and a transfer of data to the file servers. Both of these tasks employ database-like techniques--the client records all metadata and log updates using a transactional library to provide crash consistency, and the replay log resembles a simplified version of database recovery methods. Interestingly, the authors find that conflict resolution is seldom necessary during reintegration, since write-sharing is uncommon in the workloads they observe.

Problems:
The analysis of reintegration time is fairly cursory; I would have preferred it had the paper based it on some form of trace replay, as it does with its cache size analysis. In addition, for extended periods of disconnection, some mechanism to explicitly cache files would be useful, allowing users to work on different things while offline (currently, the caching policies favor things recently worked on, which may not be desirable in all circumstances). The system also does little to handle conflicts; in heavily shared situations, explicit version control (like SVN) may be more desirable.

Overall Impact:
The paper does a good job of exploring the ramifications of a seemingly simple proble; it seems like it would be a good starting point for anyone seeking to increase availability in a system by allowing offline operation. That said, I would be interested in seeing how these policies cope in a modern environment, where the footprint of applications is significantly larger, touching both more files and larger amounts of data; at the very least, hoard profiles would probably have to be more carefully managed.

Summary
Coda is a network filesystem similar to AFS. Its caching mechanism on the client allows clients to completely disconnect from the network, continue to work on files in the cache, and then later reconcile with the server after reconnecting.

Problem
Many users have portable computers. These portable computers are powerful enough to work as stand alone computers. These users would like to be able to work with files stored on the network while they are disconnected from the network. Some users manually copy and reconcile files in order to work on them while disconnected.

Contributions
Coda uses Venus for caching whole files. Venus allows users to access files in the cache after the client is disconnected from the network. The user can also manually specify files to include in the local cache. Coda also tracks changes while disconnected, and automatically reintegrates them when possible. The authors also discuss the difference between storing files on a server compared to storing locally, what they call first class vs second class replicas.

Flaws
I thought the use of priorities with manually specified hoard profiles was a little strange. There seems to be several different goals, none of which are very well served by using priorities. As a user, if I specified files that I wanted to work with while offline, I would expect that those files would always be available, regardless of the priorities that are automatically assigned to other files. Second, there are dependencies. One file out of a large project may not be useful on its own. And all files are dependant on their parent directories; for this case Venus automatically assigns parent directories infinite priority. Finally there are system files, which are needed to make the client run. Perhaps priorities are less of a problem now that laptops have much larger storage capacity.

I also thought that the way they measured write sharing between users was strange. They measure this by looking at the total number of mutations. I think it would be more meaningful to give the total number of write shared files out of the total number of modified files.

Discussion
Working offline is still difficult. It may be less of a problem today since internet access is more ubiquitous. As the authors noted, most of the time users are working on their own files. If users are working on the same files, they are probably using collaboration tools, like SVN, to manage changes and conflicts. Even though we want access to our own files, we also like having them on the network for better accessibility and backup. There are some services, like Windows offline files or Dropbox that provide syncing, which is very similar to Coda’s disconnected access.

Summary:
This paper describes disconnected operations in Code file system, which enables a client to continue accessing critical files during temporary failures with shared servers. Several replica control strategies are proposed and real prototype is introduce to prove that disconnected operation is feasible and efficient in Code file system.

Problem:
The problem this paper focuses on is how to handle remote failure in a distributed share file system environment. This paper proposed to allow disconnected operations during the disconnect interval. The big challenge is how to manage the cache, including pre-load client's cache with critical data, log changes during disconnection, and replay or integrate when connected back.

Contributions:
1. The idea of "disconnected operations" is novel for distributed file system. And this idea still makes sense in modern computing environments,
such as mobile platforms.

2. This paper proposes a transparent framework to achieve the goal of disconnected operations: hoarding, emulating and reintegrating. This makes the clients hard to feel disconnected even under remote failures.

3. Sound design and detailed implementations. For example, for scalability, it uses callback-based cache coherence and whole file caching. First and second class replication and replica control strategies are proposed for different operations and environments. Using logging to store the updates for future replay.

Flaws:
1. The idea environment for disconnected operations is there will be few conflicts. Such as everyone update its own directory in share file system. But for shared update intensive environments, such as source code repository, the conflicts may bring big trouble to clients when re-connected.

2. During disconnected operations, to avoid the space is full of new updates, it can use online data deduplication at client side. When reconnected, it can send the data hash for each chunk first, if not matched, then send whole chunk. This may improve the performance of integration.

Applicability:
Disconnected operations are still desirable in modern computing environments. For example, for cloud storage services, multiple replicas may exist in mobile phone, laptop and desktop. During disconnected, caching updates (new files especially) locally, and later sync to cloud would be a good scalable and usable solution instead of not allowing any updates. Represented products are: dropbox, mobileme, ubuntuone. Email system also can utilizes this idea. You can write emails on flight without network. But these emails will be sent when your computer is online.

Summary: Kistler and Satyanarayanan describe the architechture of Venus, a network file caching application, within the Coda file system, which is a descendent of AFS. They present Venus as the Swiss-army knife of local caching.

Problem: The developers of this system are primarily concerned with presenting a practical, transparent and intuitive file system interface to the user while maintaining availability during disconnected operation. They set out provide consistent, though not necessarily identical, operation of clients in the face of voluntary and involuntary inaccessibility of the first class file servers.

Contributions: The approach taken by the designers is decidedly practical and user-oriented. Unlike many of the other examples of distributed file systems that we have seen, Coda adopts an "optimistic" (previously known as "unsafe") mode of operation in order to take advantage of the typical case in which files are not being concurrently updated by multiple users. A key realization that they make is the distinction between a client and a server, in that a client should not be expected to maintain a persistent connection to all other nodes in a system. They also rightly point out that due to local caching, there was already a latent potential for greater availability. Venus was implemented to take advantage of this potential and to increase it through optimizations and customizations for different use cases.

Flaws: The implementation of callback breaks, a process by which a primary file server contacts subscribed clients when the cached version of a file becomes stale, seems a bit ad hoc and counter to the idea of supporting disconnected operation. The goal of callback breaks to promptly maintain cache consistency is admirable, but it seems to work against the goal of disconnected operation because, as the authors explain, it can lead to unavailability if the timing of the disconnection doesn't work out fortuitously. This is bad from both an availability and transparency perspective.

Applicability: The designers of Coda/Venus take on the role of steering Computer Science towards a focus on the user experience and the importance of consistency in design and behavior. As more and more development teams gravitate towards an "always connected" model (the Google Chrome laptops come to mind), they would be wise to build in escape hatch to allow for what could be called "desert island computing." Being connected to the Internet can often be as much of a burden as it is a resource, so a little disconnected operation every once in a while isn't a bad thing.

Summary:

This paper describes a filesystem called CODA filesystem which is a file system designed to let a user to access important data during temporary failures.

Problem:

This paper aims at solving the important problem of providing hosts in a distributed filesystem continued access to critical data during the period of temporary failures. Such a mode of operation is called disconnected operation. The paper aims at making this operation efficient and also focuses on providing a recovery mechanism once the failure period is over.

Contributions:

One of the important properties of the proposed filesystem is that it is a location transparent filesystem built with commodity PC’s. There are a large number of clients and a few data servers. Many of the clients are assumed to be low end Laptops. This helped in making the system scalable.

In order to provide the disconnected operation and to make the service more available the system uses a caching mechanism at each client. During the periods of failure the client can access data from the client and continue its operations. If the data does not exist in the cache the client gets an error message. To make this caching mechanism scalable they propose to use whole file caching and call-back cache coherence. The caching mechanism also allows a client to assign a priority to file which is used in the cache eviction policy. A file with a higher priority stays in the cache for a longer time.

Another important design decision which interested me was the categorization of the quality of replicas. The replicas in data servers are given first class priority and are considered to be more persistent. Additional effort is taken to keep all server replicas consistent. Client replicas are less consistent, but they provide for greater availability. To further enhance the availability of the service the paper suggests using an optimistic replica control mechanism where replicas at clients are allowed to be updated during the periods of network partitioning.

In order to facilitate the recovery of the system from failure a logging mechanism is provided which logs all the changes made during disconnected operation. The changes in the log are replayed during recovery and are used to detect conflicts.

Applications:

1. This is a useful idea, this form of disconnected operation can be used in mobile/wireless networks where the endhosts are lightweight clients needing to work on important data.

2. The idea of allowing a user to provide a priority that could be assigned to a cached file could be used in places where user could purchase a preferential cache service in a shared cache.


Flaws:

1. It is suggested that server replication is done manually, this would not scale in the case when the number of data servers is large.

Summary:

The paper talks about Coda File System and Disconnected operation in this file system. Similar to AFS, Coda do file level caching and also provides access to the file system while disconnected from the server and the paper primarily talks about is functionality only.
This functionality is attained by keeping three states: hoarding(when the system is connected to the server and hence caching actively used files and also predicting the files a user might use later while disconnected),emulating(the state when the system is disconnected from the file system server and hence emulating like an active file system) and reintegration phase(the phase when the system comes back in connection with the server and start reintegrating the changes during emulating phase with the file system server).


Problem description:

The primary problem is to use disk as a local cache for a centralized file system. Also, to allow disk to work as a regular file system even in case of disconnection with the file system server. This involves problems like using cache efficiently while disconnected, efficient logging of the updates for replay on integration, estimating/predicting the future request of the user to estimate the files to cache(to avoid misses during disconnection).


Contributions

The paper tries to solve interesting problem and uses various techniques for the problem. Although the techniques are often well known like transactional update to metadata, locking while reintegration, logging and compression. Yet, the problem in itself seems very useful and interesting. Also, applying such solutions like prioritized cache management, compressed logging to save disk space and transaction update to metadata etc. were important to ensure the correctness and efficient working of a cached File System.


Flaws in the System:

Usual flaws stand here too. Like, how conflicting updates in the system. Evaluations like effect of disk size on the file system performance, effect of efficient cache utilization would make the work more useful and practical. The paper provides various techniques/solutions to problems but usually lacks in the process of figuring out the faster/efficient ways of performing all the operations (like transaction update- is there a better way to do the transaction update? Or at least does not talk in detail about each of these problems).


Applicability:

I think the idea is really useful in these days. One can always get on and off internet. But such operations allow more automatic integration with centralized repositories and yet allow freedom to work in internet free zones. This is important especially with the rise in storage and everything being on the internet. Yet it is not possible for a person to be on the network all the time. Disconnected operations/transactions are used in database domain as well to ensure allow offline transactions.

Summary
This paper presents Coda, a file system that allows nodes to continue to operate on the files they have cached when they become disconnected from the other nodes. The disconnection can be planned or unplanned, so Coda needs to be as prepared for disconnection as possible at every moment, and also needs to be able to resolve conflicts at reconnection.

Problem
Users have long wanted to share data with each other, and mobile computing is becoming an increasing trend. These two goals, sharing and mobility, are hard to reconcile since mobile systems are often disconnected. Even regular systems sometimes have to cope with anomalous disconnections. The challenge is to find ways to ways to make files available to disconnected users and also prevent inconsistencies between copies on the disconnected nodes and the primary servers.

Contributions
The paper makes the following contributions:

(1) Analysis of real workloads and conflicts is provided. This has interest independent of Coda.

(2) The paper did a nice job of describing the policy challenges that balance the goals of keeping actively used files in the cache versus keeping the files that will be most needed during a disconnection.

(3) It is cool how Coda functions implicitly by observing access patterns but also allows explicit control of policies. This lets new users take advantage of Coda immediately while also giving advanced users the control they need that will prevent them being frustrated.

Flaws
The paper treats files as independent units. For instance, suppose a simple “database” consists of a separate file of sales for each separate product sold by a retailer. The database also contains a file with aggregate sales statistics. A disconnected computer updates the “laptop sales” file while a connected computer updates the “PDA sales” file. Each system updates the copies of the “aggregate sales” file. When the former system becomes reconnected, it’s local copy will be merged back in, but the aggregate sales file will have a conflict. Neither the current copy nor the remote copy agree with both the individual sales files. The user likely won’t be aware of the relation between the files, and the conflict resolution tools will be useless if the data is stored in a binary format. A hard-to-fix inconsistency will almost certainly be introduced. I think the writers recognize this problem as they claim that their system represents “an excellent opportunity for adding transactional support to UNIX”, but in reality, Coda does nothing to make adding transactional support to UNIX easier. Rather, adding transactional support to UNIX is necessary for Coda to not be broken.

Application to Real Systems
The idea of disconnected file access is often useful in today’s computing environment. However, for reasons give in the flaws section, it makes more sense to provide this functionality at the application level rather than at the file systems level. Indeed, this is what we have seen in industry. Calendar applications on PDAs have used these techniques, but as far as I’m aware, no commercial file system is based on the Coda design. Also, even though these techniques are used for some applications, I expect it will become less applicable in the future as more devices have cellular Internet access and WIFI hotspots become increasingly common.

Summary:
This paper is about disconnected mode of operation that enables clients such as laptops to continue to access critical data during a disconnection from a distributed file system. It basically does this by caching the data files when the connection is active, log any changes during a disconnection, and reintegrate the changes when the connection is back.

Description of Problem:
Distributed file systems allow users to share storage but when a client loses its connection to the network, this usually means that those files stored in the distributed file system are no longer accessible until the connection is reestablished. With the increase use of laptops, users are more likely to become disconnected from the network due to many reasons such as wanting to work at home. Therefore it would be nice if there was a distributed file system that allowed users to access their files when they lose connection to the network and synchronize the changes when the connection is reestablished.

Summary of contributions:
Coda uses whole file caching to allow offline file availability. Their implementation provides tries to provide transparency and involves three states: hoarding, emulation, and reintegration. Hoarding caches the files but it also determines what files to cache and with what priority. Emulation happens when a disconnect occurs and it deals with file access control, logging the changes, and the cache being exhausted. Reintegration happens when the connection is reestablished and it deals with synchronizes the changes back to the distributed file system.

Flaws:
The author wish to preserve transparency but it seems like its too complicated for a program to determine what files a users wants and will need that they provide some manual commands to include specified files. They also have priorities that can be set in case the cache gets exhausted. This seems like too much configurations to set that it is better to just let the user specify what he wants cached and synchronized.
The conflict handling doesn't allow to resolve conflicting replicas. It might work if most files are only accessed by their creator, but if many users try to update the file offline, then an incorrect version of the file will be created on reintegration.

Application to real systems:
It would be great if there was a technique that would allow transparent offline availability since connectivity is not always possible. I think the paper points out the main issue that needs to be consider: caching, logging, and synchronizing. I recall watching an Adobe webcast that showed some sort of offline database access and synchronization feature while trying to learn about Adobe Flex so it seems to be used in current real systems.

Summary:

The Coda file system provides disconnected operation for its users by utilizing an optimistic replication strategy that permits reads and writes in the presence of partitions and by using a client-side caching strategy that draws on user input and data reference patterns to determine the current policy.

Problem Description:

The aspect of Coda discussed in this paper addresses the problem of making a distributed file system available even when the client cannot contact the file servers that store the files. The authors state that this problem is important to solve due the advent of portable computers when Coda was first designed. They wanted to be able to provide clients a way to use a file system even when the network connection to the file servers was unavailable, be it voluntary or involuntary. The authors site other file systems such as NFS and AFS that cease to work if

Contributions Summary:

A key contribution of Coda’s disconnection computing support is its prioritized caching algorithm for caching content at the client. There are two facets to this algorithm. The first is determining what to place in a client’s cache. There are two sources of input that contribute to this decision. The first is a user-specified hoard profile. This profile indicates user preferences for important files that should be given a high priority for existing in the cache. Coda also keeps track of which files are most recently referenced to construct a working set to place in the cache. The second aspect of the algorithm is to periodically do an audit of the cache to bring it closer inline with the desired makeup of the cache. This is referred to as hoard walking.

Another key aspect of Coda’s disconnected operation is that while disconnected, the Coda client does its best to accept all user modifications. However, the server still validates all updates on reintegration. This follows Coda’s model that the servers can be trusted and the clients cannot be trusted.

Shortcomings:

The authors indicate that Coda is designed for usage patterns common to user-focused systems where users drive most of the I/O usage with their actions. This is to say, that Coda would not be the ideal system for a server setting. Specifically, it sacrifices some consistency for availability. If high consistency in a highly concurrent environment were required, Coda would not fit this environment. Additionally, its cache management scheme is tailored to having user involvement. In a server setting, this may be infeasible.

Application to real systems

Disconnected operation is particularly relevant today with the advent of many mobile devices. However, its applicability may be waning with highly available wireless connectivity. Regardless, it is conceivable that a wireless device may leave its home network and disconnected access to certain pieces of data from the home network may be desired. The ideas from Coda are then applicable to new systems to manage devices in the home such as Microsoft’s HomeOS.

Summary

The authors describe how the client-side cacheing mechanism of the Coda distributed file system was extended to allow for disconnected operation, where clients can be completely disconnected from Coda servers, but still have access (including write-access) to shared files. Mechanisms for "hoarding" data in anticipation of possible disconnection and logging updates while disconnected for eventual reintegration are described. Usage measurements of a Coda system deployed at CMU are provided, and seem to validate the "optimistic" approach to conflicts taken by Coda.

Problem

There are many advantages to using a shared data repository in the form of a distributed file system, however one major disadvantage of such systems is that, when a client is disconnected from them, they typically stop working altogether, sometimes severely inconveniencing the user. Furthermore, it would be nice to support intentional disconnection of clients, which is a common case for mobile computers which do not always have an active network link.

Contributions

A key insight is that client-side cacheing, which is already used to improve performance in many distributed file systems, can also facilitate increased availability, including disconnected operation.

When Coda clients are connected to servers, they are usually in a "hoarding" state. While hoarding, the client hoards data in its cache in anticipation of possible disconnection (of course, it is also responsible for cacheing for performance while connected). Data is hoarded based on both implicit information (recent reference history) as well as explicit information provided by the user ("hoard profiles", which are lists of files and directories a user wants to be available while disconnected). Clients periodically use a process called "hoard walking" to maintain cache equilibrium (filling the cache with the highest-priority files possible).

When a Coda client is disconnected, it operates in an "emulation" mode. In this mode, the Coda client performs many actions usually handled by servers, such as access and semantic checks. It allows write operations, which can be either mutations of existing files or creation of new files (in which case temporary file identifiers are created, to be replaced by permanent identifiers once changes are synced to the servers). In case of a cache miss while disconnected, the client will return an error to the user (optionally, the user can configure their client such that it blocks until a cache miss can finally be serviced). All updates that modify the state of the shared file system are logged locally in a persistent, transactional log which contains all the information necessary to replay all updates.

Upon reconnecting with a server, the Coda client goes through a transitory "reintegration" stage. During this stage, replay logs are sent to the servers, and each operation in the log is validated and executed on the server. If no conflicts are detected, the reintegration suceeds and the client goes back to its usual hoarding state. If a write/write conflict with another client is detected, the contents of the replay log are written out to a local replay file, and users can use a tool to manually reconcile conflicting updates one-by-one.

Flaws

Certain writes which Coda detects as conflicts have a semantic meaning which can be automatically resolved without user intervention. For example, two different clients may append records to a file, with the intent that both their appends be reflected in the file without any particular constraint on order (GFS supports this sort of append). Coda could be extended so that rules for reconciling such "compatible" writes to the same file can be defined based on application-level semantics.

Applicability

Mobile computing has exploded in popularity since it originally motivated adding support for disconnected operation to Coda. Although wireless data links are now often available to mobile devices, such links can be intermittent, and of poor quality or bandwidth (or simply overly expensive, as is the case for some cellular data plans). Access to shared data from intermittently-connected devices such as laptops, tablets, and smartphones is hugely useful today.

Summary
Authors describe the design and implementation of disconnected operation in Coda file system.

Problem
In a distributed file system, servers stores data and users use clients to access data and run some applications. Authors define the mode of operation that enables a client to continue accessing data during temporary failures of shared data repositories as Disconnected Operations. Authors claim that a distributed file system with disconnected operations can improve performance and enhance availability.

The problem authors solve in this paper is design and implement disconnected operation in Coda file system.


Solution

The key part of Coda client is Venus, which handle interactions between clients and servers, such as remote access, disconnected operation and server updates.

There are three status of Venus:

Hoarding: Venus hoards useful data in anticipation of disconnection. Venus use recent reference history and per-workstation hoard database to calculate priority of objects, and use hoard walk to restore equilibrium for each object.

Emulation: after disconnection from server, Venus enters Emulation status. It record updates to data in this status. And it also keep cached data in non-volatile storage in order to recover cache in next restart.

Reintegration: After reconnecting to servers, Venus enters Reintegration status. Venus uses replay algorithms to submit updates to servers and do conflicts handling in this status.


Flaw
1. Coda cannot concurrent updates from different client. I think a simple last-win strategy is good enough for updates on a single file.

2. I don’t find any benefits from voluntarily disconnect the connection between servers and clients. A more reasonable design is that clients keep listening data update news from servers. And if updates happen, clients leave rights whether update client data to applications. More attention should be paid on the failure of server access.

3. files manipulated by other clients are assigned higher privilege in Coda. Unless background information is provided, I don’t think manipulations from different clients are related.


Relevance

1. local caches on client improve performance of access data from applications on clients. And it can keep application on client still working during server failure.

2. update on data will be merged in clients before submitted to servers. Old updates can be overwritten by new updates and this will cut useless updates and improve performance.

3. using transaction when submitting updates onto servers, and this can keep data consistent.

Summary: Coda is a file system designed to enable file access and modification while users are disconnected from the network. The system is especially useful in situations where users may be mobile, e.g., laptops, accessing files while away and synchronizing upon return to an enterprise network.

Problem: Being able to access files in the presence of a network partition, either purposeful or as a result of some failure, is the problem Coda seeks to solve. Systems like NFS and AFS require an active connection to the server to allow users to be able to read and write files. This approach is unfortunately subject to network failures and does not allow mobile users to access files while away without establishing a remote connection. The challenges in addressing this problem include: 1) deciding what to cache, 2) tracking changes, and 3) reconciling changes, including resolving conflicts.

Contributions: Coda's major contributions are reflected in some of its key design decisions. First, Coda takes an optimistic approach to enabling access to local replicas: if the local cache contains the file or directory, the user can read and modify it, regardless of their connection state. Users are not forced to hold a lock on the data prior to manipulating it during disconnected operation. To cope with this decision decision, Coda maintains a log of all changes a user makes while disconnected. Furthermore, when the user reconnects to the network, the log must be replayed to synchronize changes to the Coda file servers. Coda's second primary contribution is the idea of a somewhat hierarchical replication. When connected to the network, changes are made to the files and directories residing on servers. However, in disconnected operation, changes are made to a second class replica, namely a local copy of the data. One could imagine taking the idea even further and allowing tertiary replicas to be created from secondary replicas.

Applicability: An approach similar to Coda's disconnected is actively used in Microsoft Windows. A user can configure Windows to automatically cache some files offline, e.g., my documents, public shares, etc., so they can access them on their laptop from home. When they reconnect the laptop to the network, Windows automatically syncs any changes between the laptop and the file server. In the event of conflicts, users are prompted to resolve them manually.

Summary: Coda is a filesystem similar in many ways to the earlier AFS. It adds support that allows clients to continue working with data even when temporarily disconnected from the service network. A prototype implementation is presented and shown to be feasible and practical.

Problem: Even at the time this paper was written it was becoming clear that client devices were becoming more portable and independent, and yet also more powerful. It was, and still is, desirable for a device to continue to be useful even when disconnected from network services. Further, transient fault-related disconnections are also common and should be tolerated as well. Instead of manually copying files to a local namespace and reconciling once the connection is reestablished, a good filesystem should handle this transparently. The user should continue to have access to the same namespace with changes automatically updated on reconnect.

Contributions: Just as in AFS, Code caches whole files on the client machine while working with a file. Cache consistency is supported using a callback mechanism, the server promises to inform all clients currently working with a file whenever changes on the server-side occur. When disconnected the client logs all changes locally and when reconnected, the log is replayed to reintegrate the changes on the server. There is large similarity to database recovery methods here.

Coda also has a novel mechanism for managing local client caches using a hoarding profiles. This combines information about recent usage of files with priorities specified by the user in order to decide which objects to cache and which to evict.

Limitations: The authors point out that this system is targeted on the common use case where concurrent access to data is infrequent. Thus complex reconciliation procedures when conflicts are discovered after a reconnect are rare and can be handled in a sort of ad hoc manner. If you took a harder line with this assumption it could simplify the design substantially. Simply support single user access only. You could then assume that no updates occur on the server side during disconnection. In a multiuser case, user can continue copying out files when they disconnect and copying back when they return. This might more closely match people's mental model of how their device operates anyway.

Applications: This problem has become even more pervasive as we move toward an increasingly mobile computing world. There are some examples of services like this (Dropbox) that concentrate on the single user case with limited support for sharing. Another example is the online/offline support in Google Docs which also supports collaborative access.

Summary: This paper presents Coda, a DFS which supports disconnected operation.

Problem statement: Users of distributed systems would love to be able to work even when they are disconnected from the servers. Also, local operations are impeded by remote failures or network disconnections. Distributed file systems store files in servers - this server replication has many nice features such as security, integrity, ease of backing up etc, which we do not want to lose. Is there a way to transparently provide offline access to certain files, without compromising server replication? This paper shows how this can be accomplished.

Summary of contributions: I liked many ideas from this paper: Having a bulky client-side component, running on user-space for portability (no shoehorning of disconnected operation to only few OSes), optimistic replication allowing local modifications while operating offline, keeping track of the changes and replaying them at the server upon reconnection, hoarding based on file access patterns while the user is connected to the server, a hybrid cache management approach which also involves the user - letting the user decide which files need to be cached - while the client component also keeps track of access patterns to reassess retention priorities, and the optimizations to keep the replay logs compact (only last stores stand, merging of operations etc).

The system exploits the property that there is very less write-sharing among files in typical user operations, which makes a lot of sense. Another interesting perspective this paper provides is how caching can improve availability in addition to performance.

Flaws: The system does not seem to be performing "merging" of conflicted files and just leaves the files and replay logs for later inspection and manual merging. This might leave the user at bay, particularly with binary files (say MS Word/Powerpoint files).

Storing modified files in caches as objects and the replay logs might lead to duplication of files and cause resource exhaustion sooner. This could be prevented by storing deltas.

Upon reintegration, all modified objects are locked before modification. I wonder if this might cause server slowdown for all users when one particular user, who has a lot of local changes, is reconnecting.

Evaluation sounds too good to be true - virtually no conflicts at all, very modest local cache sizes, very low volume of sharing (see file mutations caused by the same user). I wonder if a more thorough evaluation of system in a more realistic system, as in an enterprise, will throw more light on vital characteristics of the system.

Applications: A wide variety of softwares and mechanisms come to my mind, which seem to be tangential to Coda's fundamental focus of disconnected operation: Google Web accelerator (anticipatory prefetching, caching, managed connections), Dropbox, Evernote, the Git version control system, HTML5 feature of LocalStorage and in-browser apps (which evolved from Google Gears and Adobe Flash's ability to provide local storage). This idea is very relevant these days, given the rise of mobile computing and the rise of cloud apps.

Disconnected Operation in Coda file system
Summary
This paper presents the Coda file system which is a distributed file system like NFS and AFS with a distinction that it performs well under disconnected operation through client side caching.
Problem
Users of a distributed file system are put to severe inconvenience in the event of a remote failure. They are immediately exposed to failure. Instead it would be better to have a shared repository like operation in the event of remote failure.
Contributions of the paper
Deriving inspiration from AFS, CODA also uses whole-file caching which simplifies the failure model by putting cache misses only during open call and not during other calls. This in turn is great to work when the connection disconnects. This enhances scalability by pushing the functionality to client.
The work stations are assumed to be portable and disconnection is even a normality in this case.
Two kinds of caching is maintained both at server and client level. There is server level cache in order to give the accurate global data. Server level replication is used as a last resort for synchronization. Also an optimistic replica strategy is followed for increased availability
The paper describes the client Venus and its 3 stages: hoarding (anticipatory prefetch),emulation(pseudo server stage of logging,recovery,meta data management etc),reintegration (synchronise with the actual state). Conflicts are resolved like Locus.
Concerns
Just because writing Venus as a user program is easy to debug, Comprising performance in the process is not desirable for the single most important component of a CODA client. I wonder if the evaluation was far too defensive. Also, This degree of transparency have been compromised to provide some performance guarantees.
Relevance to modern day systems
With offloading of files to remote locations and increase in internet speed and bandwidth, distributed file systems are become increasingly important and with the scale with which the systems are growing, CODA file system does give an important insight into client side caching & whole-file semantics.

Summary:

Coda is distributed file system that allows clients to retain access in disconnected mode using caching.

Problem Statement:

Clients in a distributed file system would lose access to their files if they get disconnected from the file servers that keep the most current version of the files shared across many clients. Such a rigid design would render a horrible experience to portable clients whose connectivity pattern to the servers is intermittent in nature. Disconnected operation allows the client to continue to have access to their files but has several challenges like what should be cached, when to cache and how to combine state with server and other client state and so on.

Contributions:

The primary contribution of Coda which it borrows from its predecessor AFS is whole file caching. This simple idea allows the clients to use their local storage and most operations are performed locally thus allowing the whole system to scale. Coda uses optimistic replica control which allows different network partitioned volumes to continue their access to the files that they can access. This is very important to keep the system from moving on even in case of failures. The challenge here is that when partitions join again, we have to somehow propagate changes back to everyone. Coda maintains a per client replay log and uses that to propagate the updates. Now next challenge is how to maintain the limited cache that we have. Coda uses hoarding/greedy schemes to make sure that the most important files are always cached. Static information which a client can configure before time along with dynamic information about file accesses make good choices to decide on what is important to be cached and what is not.

Flaws:

Coda policies on selecting high priority objects for caching is fine, but it’s definitely not the best metric for selection. There are always other kinds of interdependencies between files that Venus cannot observe and could render the disconnected operation useless. Coda does not have any elegant solutions for resolving conflicts in different versions of same files when partitions join again.

Relevance:

From user point of view, disconnected mode of operation is imperative in distributed file system. We have seen how applications that provide such form of support have become so popular. Most of us have used Dropbox and we know that it performs whole file caching. Notable difference in Dropbox’s operation from Coda is that Dropbox does not intercept system calls rather uses file time stamps to find modified files and then identify which blocks of this files were modified by performing a hash on block size of 4MB or less. This is actually much more transparent than Coda’s approach since now you can easily become OS independent. Also Dropbox does not have any prioritizing schemes to whole file caching, it simply does not continue to sync when online store exceeds local space which is a simple and a reasonable policy. Also unlike Coda, Dropbox creates multiple versions of files as they being edited (during all flushes), which provides great reliability in case of sudden failure.

Summary
This paper presents the design and implementation of supporting disconnected operations to allow users keep working on critical data transparently under the temporary failures of a shared data repository. Disconnected operations provide the second class replication in Coda File system.

Problems:
1. Remote failures may affect the availability of system. How can a user continue working at critical data with the help of caching during the temporary failures?
2. When the client is connected again, how can the system integrate with cached data efficiently?
3. How to preserve transparency under failures so that users won't feel anomaly of being unable to access files?

Contributions:
It splits the disconnected operations into three states : Hoarding, Emulation and Reintegration. Then it introduces many strategies and techniques to balance availability, consistency and scalability. Prioritized cache management takes advantage of both implicit information and explicit information to determine the priority of a cached object so that the object with higher probability is more likely to survive during the failures because the system will replicate those objects first. Hoard walking (reevaluate priorities and name binding of HDB entries periodically) is used to restore equilibrium and callback break problems. Logging is performed during Emulation state to keep track of operations. Operations are compressed and optimized to reduce the length of log. Finally, transactions are used to manipulate metadata and perform replay algorithm to ensure consistency.

Flaws:
1. Cache the whole file may be too large granularity. Sometimes clients may just modify smaller piece of a large files. Loading the whole file may be expensive.
2. When conflict is detected, the entire reintegration is aborted for a file. It doesn't give users chances to resolve the conflict even users have better knowledge about the file and can hanlde conflicts better than abort.

Applications:
1. Logging and replay is a useful technology for file system and database. With logging, we could checkpoint the system. With replay we can resume operations during recovery. In addition, log replay can be used to test codes with realistic workload.
2. Since log is usually smaller than the cache objects, sometimes it may be helpful and more efficient to simply replicate log instead of objects during replication and delay the log replay.

Disconnected Operation in the Coda System

Summary

The paper explores the requirements and implementation of the Venus Cache Manager, which allows a user access to critical data during modes of disconnection in a distributed file system environment.
Problem

The high availability of data in a distributed file system (DFS) is critical to the success of the system. Unfortunately, a user is unable to continue working in a DFS if he/she disconnects (voluntarily or involuntarily) from the system. This problem is further exacerbated with the proliferation of portable workstations, which (at the time) undergo frequent voluntary disconnections and extended periods of isolation. Consequently, to increase the availability of data during disconnection/isolation, the authors explore a mode of operation termed ‘disconnected operation’, which allows users to continue working while a server is inaccessible.
Contributions

The major contribution made by the paper is the demonstration of the feasibility and efficiency of disconnected operation in a distributed environment by implementing and evaluating the Venus Cache Manager – a client system responsible for maintaining a file cache accessible to a user during periods of disconnection from the server group.

This system allows a client machine to exists in 3 phases: hoarding, emulation and reintegration. During the hoarding phase, the cache manager attempts to meet the current as well as future needs of a user, via a cache prioritization algorithm as well as a Hoard Database, which allows the user to specify files, which may be of use in the future. Furthermore, cache coherence is maintained in the system via period hoard walks, which re-evaluates the contents of the cache, as well as provide the most recent update of cached objects.
Flaws

One point brought up by the authors was that of scalability (offloading functionality from the server to the client). However, their decision to utilize whole files as the level of granularity (i.e., caching an entire file thereby limiting cache misses to the open operation) contradicts this philosophy since it provides no means for scalability as file sizes increase and repeated moving of a large files to many users and replicas will be a performance bottleneck.
Applicability to Real Systems

In an always-connected world, the relevance of a disconnected mode of operation may seem to be a bit antiquated. Save for a few places (in a developed country), being disconnected is the exception rather than the norm when compared to the time of the papers writing. That being said however, involuntary disconnections/interruptions are still a problem today and it would be prudent on the part of developers to provide some mode of recovery from disconnected operation.

Summary
The authors present a methodology for the continued operation of disconnected nodes in their distributed file system, Coda. Nodes which lose network availability are still given the ability to access some information through the use of whole file caching. The design considerations are explored in detail, and the file system is benchmarked on its time to reintegrate the changes made by the Andrew benchmark.

Problem
Coda is a distributed file system, based on AFS, which already provides support for server replication and whole-file caching. Many of its users have “portable” computing devices, and would like to have continued use either during unexpected temporary network failures, or when they knowingly disconnect for some time. Of utmost importance to Coda is transparency: the user should not have to deal with network failure. Furthermore, the re-integration of changes made to the file system while disconnected should only involve the user when there is no other recourse. Coda also attempts to solve the issue that certain files are likely much more important for user productivity than others, and should be kept available.

Contributions
One main contribution is the usage model for file system cache. It serves both improve the system’s performance during connected operation, and act as a buffer to the FS to provide availability. The fact that files are only cached as complete objects allows a more consistent view of the available data, and allows Coda to implement a number of optimizations in logging.
Coda also provides a way for users to specify which files they believe are important during disconnected operation, in a hoard profile. This, of course, introduces conflicting goals at the client: should it keep files which improve locality, or should it keep important files in anticipation of failure. The solution they propose is to periodically perform a “hoard-walk,” which is simply to evict items which are not in the hoard profile. This way they put forth a best effort for disconnected operation, while not completely sacrificing performance.
In addition to the algorithms and policies outlined, the authors provide some supporting data for the limited occurrence of write sharing in distributed file systems. It is interesting to see quantitative data on the subject.

Limitations
It seems that the main limitation with Coda will be the difficulty of merging certain files. While it’s nice to see that write sharing is infrequent, it will still happen, and probably in unexpected ways. Perhaps instead of showing the occurrence of write sharing, it would be equally interesting to see the types and frequency of common write sharing patterns. Maybe if we understand the circumstances which create write conflicts, we can reduce the chances of these patterns somehow.

Applicability
It is certainly the case that we are in an age of increasing internet connectivity, where periods of time between isolation are increasingly small. This is not to say that network failures don’t happen, so something like Coda could be useful in certain applications. My main concern about using such a system is that I would be forced to be involved in the re-integration process, which sounds like it could be painful, especially if I didn’t personally make the changes to a file.

Summary
In this paper, the authors try to show that disconnected operation in a file system is feasible. They use caching of data to improve availability once a failure occurs. They design Coda, a location-transparent shared Unix file system. Coda runs on a collection of trusted an untrusted Unix stations which work in a typical academic and research environment.
In this paper they focus on availability, and try to achieve that through both replication and mechanisms to conceal disconnections from the file server. In doing so, they put a lot of functionality on the client machines. Thus, once a file is retrieved from the server, all activities on the file are performed in client cache, and all the updates are reflected once the file is closed.
Problem statement
With use of distributed systems we rely on use of resources that are not on our machine. Once a remote failure occurs, our operations are affected, even though they could potentially be completed in the local machine. The problem is how we can continue to function once a remote failure occurs.
Contributions
- Design and implementation of the different phases of disconnected operation (Hoarding, Emulation, reintegration)
- Design of cache manager (Venus) which dynamically obtains and caches volume mappings.
- Design of a mechanism for propagating the modifications to the servers and conflict resolution once a disconnected client comes back.
- Use of caching along with replication to maintain high availability and mitigate the cost of replication.
Critique
In Coda, voluntary disconnections and involuntary disconnections are treated the same. It might be useful to perform a graceful disconnection once the disconnection is voluntary. You could expect some cooperation from the user in this case to save specific states before disconnecting (just like the “safely remove hardware” feature in operating systems.
This paper is similar to Locus in the fact that they try to provide some sort of transparency to the user. We would still have the same problem that the user expectations from the system might become high, and the fact that we might not be able to achieve full transparency at some cases. Some conflicts need manual intervention to be resolved.
Similar to Dynamo, they have to compromise consistency in order to achieve high availability which is the central focus of this paper. This makes sense for the academic/research environment where Coda is working on, but this system cannot be used for services that require strict consistency guarantees.
Applications
An important application of disconnected operation is in supporting portable machines (laptops/mobile phones,…). If we could conceal the disconnection from the client, this would be very convenient for the clients. Another application would be in disasters/unexpected events where the clients and servers get disconnected. In this case, it is very useful to have some period in which some operations can continue and service is not discontinued until the issue is resolved.

Post a comment