CS 739 Review Blog: Disconnected Operation in the Coda File System

Summary:
This paper describes their design and implementation of the Coda File System, which is a successor of the AFS system, to achieve the disconnected operation.

Problem Description:
Disconnected operation allows users to access critical data while temporary failure. The Coda is designed for a environment very like that of the AFS: several trusted Unix servers and a bunch of untrusted clients. There are mobile nodes. The concurrency is low and there’s no need of fine-grained write sharing.
The issues considered while designing include scalability, availability and consistency, etc. The techniques or decisions to achieve these goals include callback, whole file caching, optimistic replication, caching on servers as well as clients, etc. Although the central idea caching is simple, the implementation is complicated. The paper aims at addressing various issues of implementing such a caching based system.

Contributions:
The caching management:
There are three states: hording, which prepares for possible disconnection; emulation, which handles requests from local cache; reintegration, which propagates changes back to server. Hording used a prioritized algorithm and maintains equilibrium. Emulation uses logging for future integration.
The authors implemented a real system and tested it. The results shows that disconnection operation can be feasible and efficient.

Applications:
Just as the paper pointed out, the coda system are useful in certain environment, where shared writes are low. Actually the experimental result showed very little conflicts in the reintegration state. This mode of file accessing is common now. Also the number of mobile users increase these days, and the wireless network are not reliable enough.

Posted by: chong | March 16, 2010 03:09 PM

Summary and description:
The paper describes how "disconnected operation" could be used in the Coda file system. Disconnected operation is a method of providing clients access to the file system data even after there has been a partition between the client and all relevant servers actually hosting the data. The method works by having a replica of the data in the client, and using the replica to manage client requests when a partition happens. The client replica is not a full replica, and is implemented on a cache, with the data being replicated determined by usage patterns and explicit user information.

Contributions:
The basic idea behind the paper is quite simple - caching file system data on the client so that it can be used when the servers are disconnected. The major contribution of the paper is how various design challenges were overcome in the implementation of the idea. The ones I liked were:
1. Using optimistic replication in a real world, practical setting to maintain concurrent updates of data during disconnected operation.
2. Transactional logging during emulation to acheive this.
3. Determining which data should be replicated in the cache thorugh hoarding.

Thoughts and real world usage:
I believe that the paper was written very nicely. It was easy to understand, and the authors had done a very good job of justifing the motivation and evaluating the implementation. The authors also identify real world situations where disconnected operation is useful - for applications which do not exhibit concurrent, fine grained accesses.

Posted by: Thanumalayan Sankaranarayana Pillai | March 16, 2010 02:10 PM

This work adds the ability for disconnected operation in the Coda file system. They use optimistic replica control, allowing clients without a connection to the server to continue to operate over cached data.

The problem here is two fold: 1) Client machines may loose their connection to the server at some point, and, 2) Portable computers may be intentionally be disconnected from the network, yet still need access to some of the files. This is intended for applications that are not highly concurrent.

The contributions of this work include:
1)The ability to perform disconnected operations through client side caching
2)an approach that focuses on availability and the ability to tolerate network partitioning (consistency is sacrificed).
3)A realization of the similarity between voluntary and involuntary disconnections from the servers.
4)The ability to decide which files/directories to cache for voluntary disconnections.
5)A replay log that helps resolve the changes between the client and the server after reconnection.

This work appears to be most appropriate for a single person operating over their own files as concurrent operations will be a challenge. Conflict resolution during log replay seems quite primitive and probably requires manual intervention. A system like this could easily be compared to version control software such as subversion.

Posted by: Sean Crosby | March 16, 2010 02:01 PM

The paper presents Coda, a distributed file system handling disconnected operation transparently from the users. It is essentially the AFS system enhanced with persistent caching behaviour at clients to enable the users at these clients uninterrupted access tpo at east some of their files even during disconnection from the servers.

The problem that the paper addresses is one of providing additional capabilities to the file system so that portable computing devices that may be using the distributed file system can continue to do so even if they go out of range of the network or have access to only a subset of the replicated file servers. And of course as a side effect, this also helps all the traditional users during times of network outages or server failures.

The paper presents one major aspect of Coda - Disconnected operation - and defers details of its other main aspect - that of replicated servers - to a different paper. In summary, the operation of the Coda client on users' machines is split into three modes, Hoarding, Emulation and Reintegration. The first state is during normal operation with uninterrupted access to at least a non-null subset of the file servers duting which time, Coda handles updates to files as any other distributed file server would in addition to caching some of the files accessed recently by a user on his local persistent storage. Updates are reconciled as soon as they are available with any of the servers with other replicas being used by different users. During disconnected operation, Coda emulates the file system by servicing requests from the local file system and logging any modifications so that they can be reconciled with the servers when reconnected eventually. Reintegration needs to handle conflicting writes and sometimes defers to requiring manula intervention.

The system is very interesting since it was created at the beginning of the emergance of mobile devices but for some reason it is not as widely adopted as AFS or NFS are today. Also I would expect that given that most present day systems have much larger capacities than are allocated on a distributed file system per user (which may not have been the case 20 years ago), Coda would probably work even better today than it did then. Some correspondace does exist with other applications like web mail that have started allowing offline operation as well as with versioning systems that allow multiple update branches. Possibly the confluence of these two technologies can go a long way in reintroducing this file system.

Posted by: Sanjay Bharadwaj | March 16, 2010 01:58 PM

The paper talks about Coda file system which has the capability to operate in a disconnected fashion(disconnected from the network) thereby improving availability in the presenceof network failures. The clients use a cache manager venus which does whole file caching and call back based cache coherence. Call backs are provided by servers to clients when the cache copies are no longer valid. It follows eventual consistency model where local modifications in disconnected state is eventually propagated to all the server replicas.

A cache miss can occur only on a open. And this improves scalability. User assistance used to decide what to cache. Users can specify cache object priorities using hoard profiles and cache equilibrium is maintained by making sure all the objects of the same priority has the latest copy stored in the client. There are two replicas – server replicas and the cached objects in the clients. For consistency purposes, Coda has decided to maintain server replicas also in addition to the cache objects used during disconnected operations

Replication can be pessismistic or optimistic – based on whether or not to lock the object when updates happen. Optimistic replication uses conflict resolution strategies to handle multiple updates to the same object. Log and replay mechanism is used to propagate updates to global state

Contributions
1. Usage of cache for availability rather than the conventional usage for performance
2. User assisted cache object selection – hoard profiles
3. Hoard walking – to continuously keep the cache in updated and useful state(storing objects on object priority and maintaining cache equilibrium)
4. Updating global copy using log and replay mechanism

Relevance to distributed systems
Disconnected state can be treated as a kind of failure and Coda can thus be thought of a system which is available even during failures. The availability is achieved using local caches. Synchronizing local cache with the global copy can be mapped to Hoarding, Propagating updates to the local copy to the global state can be mapped to Reintegration. Replication control strategies(optimistic/pessimistic) , conflict resolution strageies for such a distributed system is also discussed. User assisted conflict resolution and cache object selection policy is implemented.

Posted by: Ramya Olichandran | March 16, 2010 01:58 PM

Summary:-
---------
With increased use for portable devices, the paper looks at the design of the Coda distributed file system that supports disconnected operation for seamless operation of these devices with intermittent or broken networks.

Coda borrows design principles from AFS and trades off consistency for achieving high availability. The two techniques of server replication and disconnected operation are used to achieve high availability.

To allow disconnected operation, Coda clients(Venus) cache required portions of data in the hoarding(connected) state. Users can specify the content to be cached using hoard profiles, which specify the most important directories, files etc. Coda uses prioritized cache management based on hoard profiles and usage patterns. Upon a disconnection, Coda client moves to the emulation state and acts as a pseudo-server using logging, and persisting changes to the file system on local disk. Finally, the reintegration state is entered when the contents of the log are replayed at the server.

As we can see, Coda uses optimistic replication of data. Hence, we need conflict resolution mechanisms and users may need to be bothered to resolve conflicts.

Contributions:-
----------------
1) Seamless integration of portable devices with disconnected operation.
2) Experimental evidence that file system users typically share files carefully and not in an ad-hoc manner.
3) Mechanism of specifying hoard profiles.

Relevance:-
------------
Disconnected operation is very useful. And in today's world with more portable devices with good storage capacity, it makes sense to maintain required personal content on mobile phones which may not be connected to the internet all the time.

I think optimistic replication works successfully for file systems because file accesses are typically controlled through permissions, are mostly single-user accesses(personal content) or multi-user read accesses. Less frequent are files(project files) modified by multiple users and in that case, people are careful enough to co-ordinate access to these files through some other means.

But given the fact that network is more prevalent today, disconnected operation may not be required for doing offline programming. And the jobg of maintaining personal data can be pushed to the application domain, e.g. web browser can cache the frequently accessed pages, the mail client can cache the most recent/important mails.

Posted by: Laxman Visampalli | March 16, 2010 01:43 PM

This paper discusses the Coda file system which allows for disconnected operation in order to maintain high availability. The central idea behind maintaining high availability is to use caching of data on a client machine. When a user is working on some files, they have a local copy in the cache of their machine. This is generally done for performance but it can also allow the user to continue work if they become disconnected from the remote server that stores the files they have been working on. This problem is important because there are situations where files needed by a user are on a distributed system (such as a distributed file system) and all or part of the system is unavailable and the user needs access to their files and they cannot access them.

To solve this problem, the authors implement Coda - a file system designed to work well with portable clients that disconnect from the system periodically. They do this by providing a cache manager, Venus, on each user's machine that is responsible for maintaining the cache. Coda keeps its goal of high availability by using server replication of stored files and by allowing Venus to take over upon disconnection and provide the user with the files currently available in their cache until they are reconnected. When the user is reconnected, Venus merges the user's files with those currently on the server.

The idea of still being able to work when your files are stored on a remote server and you are currently disconnected from that server is a good one. I can remember multiple times when I was unable to reach the server holding my files and I was thus unable to do anything so I think this problem is certainly one had by many users. I'm not certain that their technique would work well with the larger systems that we have today because users want to access more data. Also in terms of keeping a local copy and merging it with the stored copy, it sounds like they have invented an early version of subversion.

Posted by: Elizabeth Soechting | March 16, 2010 01:37 PM

This paper describes CODA, a filesystem targeted at providing continuous operation even when the client is disconnected from the server. The basic idea is to cache data at the client side, and server data from that cache in the event of a failure.

CODA is much like AFS, with the additional functionality of disconnected operation. The client side cache in CODA(Venus) can exist in one of 3 states: hoarding, emulation or reintegration. When in the hoarding state, Venus foresees disconnection, and starts caching important data from the server. Users can setup hoard profiles, indicating which files would be of interest to them. Venus goes into emulation state when disconnected. In the emulation state, Venus behaves like a proxy server, handling requests just like how the server would do. Venus keeps logging the operations it does when in emulation mode. When the connection comes up again, Venus reintegrates with the server. It is possible to have conflicting copies at different clients, and CODA has ways to handle them.

The main contribution of this work is the design of a scalable network filesystem that continues to operate when the server is unreachable. They choose client side replication as a way of enabling this. The authors have also implemented CODA and have deployed it in their lab.

Overall, I liked the paper. It presented the problems involved with disconnected operation very clearly. However, the idea of local caching and updating changes is very very simple and straightforward. The authors say that their system is transparent, but I think its only location-transparent. The user still has to provide hints about which files will be accessed in the future. I am not sure how predictable file accesses are, and this is key to the effectiveness of CODA. I also wish conflict resolution had been discussed in more detail.

Posted by: Chitra | March 16, 2010 01:09 PM

Summary:
This paper discusses the disconnected operation implemented in the Coda file system. The data is made available to the user even in the event of temporary failures.

Problem Description:
The distributed file systems offer number of advantages to the users. Wheny they faile temporarily the users cannot acces the data even though their personal systems have adequate resources. This paper attempts to solve this problem by caching the files on the user systems and serving the users from this local cache during the failures.

Contributions:
1. Extending the local caching concept that is employed by some file systems to support disconnected operations by caching the entire file instead of just a part of it.
2. Clearly definiton of state machine at Venus,the file system client. It operates in hoarding,emulation and reintegration phases.
3. using storeid to resolve conflicts and the four phase replay algorithm during the reintegration phase

The paper does a good job of explaining the difference between the first class replication using servers and second class replication by caching at the clients.

Applicabilty:
Many intesting ideas for improving the availability of a distributed file system are suggested. However as noted in the paper, this model is not effective for applications needing high concurrency data access.

Posted by: Neel Kamal M | March 16, 2010 01:04 PM

Continued operation in the face of network failures or unreachable server nodes is essential for progress and a good user experience. This is the problem being addressed by the disconnected mode of Coda filesystem.

The fact that all the clients are equipped with disks motivates the use of a larger cache at the client side. The cache, usually used to improve system performance and throughput, is used here to increase the availability of data. The key idea is to cache data of interest to the users while the system is connected to the network and serve data from this cache in the case of network failure. Like AFS, the clients run Venus which is responsible for all communication with the server. The paper discusses many design choices that were available and justifies each of the design decisions made. It talks about first class replicas (server replication) vs second class replicas (cache at the clients' end), optimistic vs pessimistic replica control, etc. The system operates in three main states - hoarding (actively fetching data of interest pointed to by the hoard database and profiles, from the server while maintaining a balance with the data required for the completion of the current operation) emulation (providing data from the local disk in a transarent way in the wake of unreachable nodes) and reintegration (making the network-wide replicas consistent with the local cache updates, on reconnection, using the log created in the emulation phase).

The paper presents a practical design to show that disconnected mode of operation is possible and can be implemented with reasonable efficiency and consistency guarantees. The idea seems to be very applicable when considering the existence of all kinds of portable devices. Users could be under the illusion of continued operation / connectivity in the presence of disconnections owing to their mobility.

Posted by: Deepak Ramamurthi | March 16, 2010 12:25 PM

Problem Description:
In a shared data repository often the user is affected by remote failures that last for short duration. This paper aims at achieving availability during temporary failures (what is called as disconnected operations) by caching the data in the client side and updating the server once its available. The authors try to achieve this aim without losing scalability, performance and consistency.

Summary:
Coda is a file system suited for research and academic purposes where high concurrency is not needed. It uses replication to achieve availability but for cases like portable clients that detaches from the networks, the data should be available even during this phase. These disconnected operations are highly required in portable machines. One observation is that in the portable workstations, the user performs manual caching of the data that he needs during disconnected periods which can be automated to ensure high availability. This is the technique used by the paper. For scalability, techniques such as whole-file caching and callback-based cache coherence are used. To implement disconnected operations, the cache manager, Venus is made to operate in one of the three states namely hoarding where the client is connected to the network, emulation where the client is disconnected and reintegration where the client synchronizes with server. Venus uses prioritized cache management based on recent reference history and the user preference in the form of data from hoard database. Venus performs hoard walking to ensure that the client is in equilibrium(the cached objects are in acccordance with priority) with the server. This technique also addresses callback breaks. During emulation phase, Venus becomes pseudo-server and caches the updates. It stores the updates in the replay log in an optimized fashion to minimize the usage of cache for storing the logs. Also store the metadata of cache persistently, Venus uses Recovarable Virtual Memory for which transactional access is available which ensures the system is taken from one consistent state to other. During reintegration phase, Venus obtains permanent file ids and replays the changes in four phases. To resolve write-write conflicts, Venus uses storeid that indicates last update on that file and takes decision appropriately.

Contributions:
The paper was really an interesting paper and an easy read too. The following points were to be noted
1) The paper provided good analysis between first class and second class replication and analyzed their tradeoffs
2) The state machine for Venus to adapt to the situation of connectivity
3) The Hoard walking technique to stay in equilibrium with the server
4) Prioritized cache management considering both explicit and implicit sources
5) The four phase replay algorithm followed during reintegration

Applicability to real systems:
The usage of portable devices is increasing rapidly and short-term network failures are unavoidable. Hence this technique for ensuring high availability will be very well applicable to real time systems involving portable systems where high concurrency is not the primary goal.

Posted by: Raja Ram Yadhav Ramakrishnan | March 16, 2010 11:48 AM

Problem
To support disconnected operation in the distributed file system called Coda. The problem involves conflict resolving and replication after connection is re-established.

Summary
The Coda is descendant of AFS distributed file system. The file system consists of multiple servers that replicate each other. Using client-side cache, optimistic replication control and replay log, the Coda can support disconnected operations. Furthermore, it adopts prioritized cache management and hoard walking that allows users to choose the files to be cached based on priority. Thus, the user can access the file system as if he is connected to the servers as long as there is sufficient storage space. To simplify error model, it uses whole-file caching. To resolve conflict arises during a disconnection, the Coda stores replay log to leave a resolution to the user.

Contribution
It provides nice transparent access even in disconnected environment. For system files that are stored in the distributed system, it would be scary if the user suddenly loses an access to them. Because of hoard walking and prioritized cache management, the Coda file system can behave like rsync. For example, if you sufficiently prioritize system files, you will most likely access to these files; it looks as if it copied into a local storage. Thus, one can store system files into the Coda which will give an advantage to system maintenance.

The adoption of replay log is logically clean concept that has various usages. Crash recovery and re-integration can be implemented by replay log. Also, in a case of a conflict, the user can later resolve by looking at replay log. Thus, replay log is safety net for the distributed file system.

Using write-back cache as a quick-fix for temporal failure is also interesting. It would lose some consistency for certain cases but it will definitely improve system’s interactivity.

Applicability and Question
It seems they eventually failed as even CMU does not use Coda anymore. I am seeing Coda is still implemented in Linux 2.6 kernel. However unlike AFS had been successfully became open source project, the Coda stales in development. So, the Coda itself is dead product I guess.

The concept of disconnected operation seems to be similar to that of distributed version management system. Subversion, GIT assumes disconnected operation as a first class citizen and synchronization occurs rarely. Even there is fundamental different assumption between the Coda and the version management system, the Coda’s hoard walking and replay log can like emulate behavior of the version management system; checking out files and committing.

Posted by: MinJae Hwang | March 16, 2010 09:55 AM

Summary:
This paper presented a distributed file system, Coda, which provides continuous data access even in face of remote failures. Coda actively caches high priority files to client’s local disk while connection is normal, and enumerates itself as a server when the connection fails. Finally, when connection recovers, Coda reintegrates client’s modification to the server for synchronization.

Problem Description:
The problem this paper tries to solve is to design a distributed file system that is capable of disconnected operation. Before Coda, distributed file systems like AFS already became popular, but none of these could continue to provide available file system when either a voluntary disconnection (such as unplugging a computer) or an involuntary disconnection (such as network failure) happens. And this paper argues that such disconnection is common and the unavailability during the disconnection will bring much inconvenience to the users. Therefore, the system this paper proposes has great importance.

Contributions:
Unlike a theoretical research, there are many design choices when building a practical distributed system, and Coda is such a typical example. We cannot argue whether the choice of Coda is correct or not, but one of the contributions of Coda is that it does discuss such tradeoffs based on OS properties. One example is the choice between optimistic and pessimistic replication. Opening files are not automatically locked in UNIX and the design of UNIX system and programs follows such style that write sharing is typically low. Therefore, an optimistic replication will enhance availability while not inducing many conflicts. Some other examples include returning an error code letting OS to handle when a file cache miss, different strategies for file and directory conflicts during reintegration stage, and the use of transactional library to manipulate meta-data.

Another merit lies in its (fairly complicated) prioritized algorithm in determining which files to hoard. The list of files to be hoarded is important for the success of solving disconnection problem, since a missing file in the local cache when disconnected is semantically equal to a failure in terms of user satisfaction. Coda provides a database for user to specify their personal files, and leverages a recent use metric for other files. It also has mechanisms for temporarily escalating priority for modified files during emulation stage. This could (probably) result in an accurate hoarding file list.

Real Applications:
Coda is one distributed file system, and is intended to be applied to minimum work interrupt when a remote failure occurs. However, the evaluation part does not show example of a real conflict occurred in an application (it only contains the number of the likelihood that an application could lead to a conflict). Basically during a reintegration stage, Coda replaces existing version with new version with respect to a logical timestamp (version number). Whenever more than one newer version exists, Coda replaces the original version with one that arrives first, and records the remaining new versions on disk to let users manually update results. But such user intervention might cause other problems, especially if the conflicted file is a system file.

The ultimate goal of Coda seems to be providing non-interrupt environment for user experience. But sometimes a consistent file system itself is not sufficient; users may experience other significant changes, such as changing from work machine to home machine. Therefore, some other techniques should be used (such as virtual machine migrations) in addition to Coda to provide a real seamlessly environment.

Posted by: Shengqi Zhu | March 16, 2010 09:44 AM

This paper looks at the disconnected nature of distributed file systems. It is an extension of AFS where there is the possibility of caching for more data availability. The main point is that during connection, there is a hoarding phase, after being disconnected there is an emulation, and when reconnected there is a reintegration. These three steps make up the majority of the Coda concept.

After reading this paper, I found that there was a solid distributed model, but nothing much to go on. I discussed the paper with Tack, and he said the only major contribution was the measurements the paper made. Clearly this was the first in this area of research, but nothing proposed was more than obvious. As such, the measurements are what is interesting. They show that this type of system allows for availability while being disconnected. What is lacking is the user experience problem when there is a conflict between the cached version and the live version.

This work proposes something that will allow for disconnected work, all without the need for locks. As such, the idea of disconnected changes comes out of the need for such progress without the centralized data in store. Then when any changes are made against the central data, it must be reintegrated before the copy on the main AFS host is changed, otherwise it may conflict. Where this paper excels in advancing the idea of separating the consistency from the object, it fails in advancing the concept of how the user interacts with this reconciliation. While Coda may work in production environments, it does not necessarily scale to a level where it will work well.

Overall, I liked the design of the disconnected Coda, but I was left wondering how it would work in practice. Really the idea that a user can disconnect and reconnect without having to deal with issues is wonderful. The problem is that if there are changes in the meanwhile, these are not easy to alleviate, and must manually be overcome. To me, this is them main failure of this paper.

Posted by: Jordan Walker | March 16, 2010 09:10 AM

Coda allows operations to be done on files that are maintained by remote servers, yet are being cached on possibly disconnected machines. It is the successor to AFS, but doesn't seem to do much beyond disconnected operations.

Certainly we have disconnected operations now in the form of SCM tools like CVS and Git. They shift a lot of burden on the user, like forcing users to actively participate in every transaction and any conflicts that arise. Though we have great SCMs, they are trivial in many respects. They do suggest that disconnected filesystem operations are worthwhile to consider.

One key observation made be the Coda developers is that the Unix file model is successful even when unsynchronized. Most files directly used are done so by a single user. Collaboration on shared files is rare, so the probability of conflict is low. Administrators rarely perform any operations simultaneously. This leads to their use of optimistic replica control (eventually consistent).

So Coda seems useful, but it doesn't seem like they really did a whole lot of interesting work. The primary feature was hoarding: caching files that are likely to be used while there is still time. It seems rather straightforward, though. I would have liked to know what kind of overhead this has on the network. It wasn't compared to anything!

It mentions transparency multiple times, but it does not achieve it successfully as is. Users must specify hoard profiles, a listing of files or directories that should be cached. Another solution is for profiles to be provided. In any case, the transparency is gone.

A user cannot know what they are likely to need, or what the hoarded files may require. It seems a profiler of some sort could do a better job analyzing usage patterns, considering it requires no effort on the part of the user. In the case of binaries, it should be easy to determine what dynamic libraries were linked to at compile time. Any other runtime requirements might be noticed by a good profiler.

As an aside, what does 'stashing' mean? At first it just sounds like a different word for 'caching', but the authors say 'the FACE filesystem uses stashing but does not integrate it with caching'.

Posted by: Markus Peloquin | March 16, 2010 07:49 AM

Summary:
The main idea of Coda file system is to improve availability by storing data at client cache when system meets disconnection problems. It could be regarded as an extension of AFS, which stores the whole file at client and uses a call back mechanism to maintain coherence. Coda’s reformation of AFS is adopting an optimistic replica control strategy, prioritized cache management and reintegration with relay algorithm.

Problem:
While distributed file system provides an efficient method for data sharing, the client still wants to retain certain kind of independence, especially avoiding to be blocked by remote failure or disconnected operations.
To achieve such goal, the system designers need to address how to get required data from server, efficiently manage such data in client cache and reintegrate different modifications of file when the client reconnects to server.

Contributions /novel idea
1. Use optimistic replica control strategy, which emphasizes on providing higher availability than consistency and is based on the assumption that there are less write operations.
2. The efficient client cache management, which combines a static pre-determined priority (explicit information) with the dynamic adjustment (implicit information). Maybe the idea is illuminated by process scheduling.
3. The reintegration method.

Practical Applications/ Comment:
Maybe an illustration of the Coda idea is the SVN tool. The checkout op likes open file. The submit op likes to reconnection and reintegration. The only difference is Coda needs to manage the client cache efficiently while SVN doesn’t need such complicated mechanism.
The use of SVN also makes me to think whether the Coda could implement like the SVN in that the server just has a master copy of file (like trunk in SVN) and each client just write back its modification of file to server (like branch in SVN). Server maintains a master copy and separately stores recent updates by each clients. Only after the update is validated by the server can it be transferred to trunk and read by other clients. The advantage is the network cost is much smaller and reintegration process should be much easier.
Another consideration is the Coda replies much on the Client, which is a little bit different from current ideas, like in clouding computing the client is very simple.

Posted by: aoma | March 16, 2010 07:29 AM

Goal:
Coda uses the whole-file caching mechanism inherited from AFS to provide mobility by allowing user to operate on cached files while disconnected. The system use both user-specified configuration and usage pattern to predict which files should be cached and will try to resolve conflict when synchronize with the main servers.

Problem:
When most of the services are provided by centralized servers, client machine is considered useless when the network is disconnected. However, allowing clients to operate while disconnected will lead to synchronization problem which will require conflict resolution.

Contribution:
AFS use whole-file caching mechanism to improve file access performance. However, this mechanism is extended in Coda to support offline operation. Thus, temporal and long-term disconnect are treated similarly. Coda’s client has three modes of operations which are hoarding, emulation and reintegration.

In hoarding mode, Coda use configuration file and file access pattern to generate file priority. Client cache is considered to be in equilibrium when no file with higher priority is missing from cache.

When disconnected, emulation mode allows user to operate on files that are in the cache. Since whole file is either missing or existed, cache misses only occur when open() system call is invoked. Each client has pre-allocated unique set of file ids to assign to new file so that synchronization is done faster. Operations on files are kept in a replay log per volume; however, certain operation will overwrite previous operation if it is obsolete.

In reintegration mode, client sent replay logs to the server to resolve conflict. The integration will fail if any of the file in the log is conflict with the server. However, user can inspect the operation log to resolve the conflict manually. Conflict resolution on some objects such as directories is done automatically by Coda.

Application:
Mobility is a challenging problem especially when most of today services are hosted on the internet. We cannot expect client device to connect to the hosting servers all the time. Thus, allowing client application to operate on cached data will increase the usability of the client device. There are growing trends to include local database into web browser. Therefore, it is possible that user do not to connect to the internet to use certain feature of a web application in the future.

Posted by: Thawan Kooburat | March 16, 2010 07:27 AM

Summary
The authors developed a file system that provides availability to a remote resource during disconnected operation through intelligent caching. The file system, Coda, caches commonly/recently used files to speed up access and in anticipation of network disconnection. On network reconnection, updates to files are synchronized with the remote server, with attempts to resolve write conflicts.

Problem
Sufficiently powerful mobile computers are becoming much more common (in 1991 and even more true today). The nature of mobile computers means they will become disconnected from the server from time to time. The mobile computers are powerful enough to handle much of the file accesses itself (in fact, users often manually copy files locally, modify them, and copy back later when disconnects are anticipated). The authors wanted to facilitate this use case transparently in the file system.

Contributions
The authors observed access patterns of AFS users for a year. Their observations showed a very low level of potential for conflicts in a distributed file system meant for sharing. This gives credence to the usefulness of the proposed disconnected operation. Though likely not their invention, the use of optimistic replication and hoarding is a novel choice to make the disconnected operation (mostly) transparent. Finally the authors contributed the actual implementation of a file system that does all this, a notable contribution.

Applicability
I think this file system is good for its intended use: maintaining availability to remote resources that are largely conflict free (low amount of concurrent updates during disconnected operation). For this one, very targeted use case, it performs very well. I know I have copied files from my workstation onto my laptop, only to copy them back later. Thus I could imagine an automated handling of this as useful to me!

Posted by: Jesse Benson | March 16, 2010 07:23 AM

Summary:
This paper describes a disconnected operation mechanism in the Coda File System, which is a descendent of AFS. Instead of focusing on exploiting local caching to enhance the availability of a distributed system, it discuss and implement the disconnected operation which is another mechanism for achieving high availability in distributed filesystem.

Problem:
Although there are plenty of benefits gained from placing large amount of data in a remote and distributed repository, the remote failure, which is referred as disconnection, at a critical juncture seriously counteracts these benefits. Coda aims continuing critical work when that repository is inaccessible. The key idea behind Coda is caching data on client.

Contributions:
To achieve high availability, Coda uses two distinct, but complementary, mechanisms – server replication and disconnected operation. This paper mainly focuses on disconnected operation. Since server replication is persistent, secure but expensive. Client replication is relatively of lower quality but cheap. The fundamental design point in Coda is to find a good balance between consistency and availability. This is not a noval idea, but in this paper, it made a complete and comprehensive discusssion on it.
In design, it takes into account many design rationales such as scalability, protable workstations, fires vs second class replication and optimistic vs pessimistic replica control. This is what I think of Coda that is fabulous in coda.

Comment:
This desin of a distributed system should depend on the practical requirement. Howerver, I didn't see detailed and specific application that Coda aims for.(I might be wrong). I would say there is not a omnipotent design that satifies all real world requirement. Neverthless, we have learned a good lesson on the design choice from Coda.

Posted by: Deng Liu | March 16, 2010 07:16 AM

Summary:
This paper describes a disconnected operation mechanism in the Coda File System, which is a descendent of AFS. Instead of focusing on exploiting local caching to enhance the availability of a distributed system, it discuss and implement the disconnected operation which is another mechanism for achieving high availability in distributed filesystem.

Problem:
Although there are plenty of benefits gained from placing large amount of data in a remote and distributed repository, the remote failure, which is referred as disconnection, at a critical juncture seriously counteracts these benefits. Coda aims continuing critical work when that repository is inaccessible. The key idea behind Coda is caching data on client.

Contributions:
To achieve high availability, Coda uses two distinct, but complementary, mechanisms – server replication and disconnected operation. This paper mainly focuses on disconnected operation. Since server replication is persistent, secure but expensive. Client replication is relatively of lower quality but cheap. The fundamental design point in Coda is to find a good balance between consistency and availability. This is not a noval idea, but in this paper, it made a complete and comprehensive discusssion on it.
In design, it takes into account many design rationales such as scalability, protable workstations, fires vs second class replication and optimistic vs pessimistic replica control. This is what I think of Coda that is fabulous in coda.

Comments:
This desin of a distributed system should depend on the practical requirement. Howerver, I didn't see detailed and specific application that Coda aims for.(I might be wrong). I would say there is not a omnipotent design that satifies all real world requirement. Neverthless, we have learned a good lesson on the design choice from Coda.

Posted by: Anonymous | March 16, 2010 07:16 AM

Summary:

With the advent of portable workstations and distributed file systems, the need for disconnected operation is pointed out by the authors. Academic environments where the amount of sharing is minimal is the main target and the key idea is that better availability can be provided by taking advantage of client level caching. This approach also has performance benefits but main problem is choosing which files to cache and reintegration in case of temporary network failures. The first problem of choosing which files to cache is solved by using hoard database and monitoring access pattern of files. The hoard database can be updated by the user interactively to identify critical files. Reintegration is achieved using emulation and replaying of log messages. The clients have a user level process Venus that uses a state machine with Hoarding, Emulation and Reintegration as states and helps in disconnected operation. The system is evaluated in terms of client-level memory required, time taken for reintegration and number of conflicts and it is observed that the number of conflicts, client memory required is very low and reintegration can be done in order of seconds even though disconnected session is in order of hours.

Contributions:

Using client level caching with priority database to provide better availability, performance in distributed file system.

Choosing open to close consistency for higher scalability. Using optimistic replication for better availability and user experience. Using replicated file systems server for better reliability.

Using mini-cache at kernel level for better performance.

Applicability to real systems:
As the authors point out, the disconnected operation does not work well for applications that need high concurrency and fine granularity data access because of reintegration issues. The system also prefers availability, scalability over consistency, so systems with stronger consistency needs cannot use the technique.

Posted by: Satish Kotha | March 16, 2010 06:51 AM

Summary:
This paper talks about the Coda File system's disconnected operation, their motivation for it, and the choices they made in designing it. They present several optimizations that they made, and an evaluation of their system.

Problem:
Considering distributed systems where clients have disks, how can availability be increased in the distributed system? If the client is powerful enough to continue operating for a short while without any contact with the servers, can this be masked for the user?

Contributions:
The primary contribution is that they take the idea of a disconnected operation, and show that is feasible to implement and somewhat practical to use in a distributed system.

Applicability to real systems:
As more and more devices connect to the internet, the idea of clients and disconnected operation makes more and more sense.

In today's systems, it is almost a necessity - When your iphone downloads your email or latest weather data, you expect it to be there the next time you open your phone, regardless of whether you have connectivity.

If this was the paper that introduced the idea of disconnected operation, then we owe much of our applications today to this idea.

Comments and Questions:
1. I didn't realize that the benefit of transactions and rollback were understood back in 1991. From my understanding, transactional memory is still a hot field, with plenty of work going on, and the basic problem itself does not seem solved. How is this possible if the main idea was realized over a decade ago?

2. After Dynamo, this is the second system that we read about that has a "manual" consistency resolution mechanism. Why don't systems build some complex time-vector or round trip based estimation scheme to order updates and decide which one to commit to disk? I understand that if two disconnected clients make conflicting updates, we can't decide which is right, but in other cases?

Posted by: Vijay Chidambaram Velayudhan Pillai | March 16, 2010 06:25 AM

This paper proposes disconnected mode of operation for client in the coda file system. This is intended to support multiple portable computers to continue their operation for a period of time without a connectivity to file servers. Coda client's are more intelligent employing various mechanisms to provide transparent access to files during such disconnected operation.

The contributions of the paper is design and evaluation of a client system that utilizes the data caching for performance to increase the availability. Three important design requirements for such a client (Venus) presented in the paper are :- Prioritized cache management and pre-fetching , Emulation of file server, Reintegration of data with server during reconnection. Venus maintain a hoard database where applications update the objects of interest required for continuous operation. Client Cache used to update data during normal operation is maintained in equilibrium state, where data of objects in HDB re-fetched to reflect updates by other coda clients. During a disconnection from server, Venus functions as pseudo server with the data limited by contents in cache. Reintegration is done by replay algorithm using the replay log of venus, and conflict resolution is based on storied (a time stamp that indicates last update time).Each replica server verifies the storeid with that of log to detect any conflicts. In case of the matching values in storeid between the log and replica, integration is performed and replica is updated. Updates to other replica servers are then done asynchronously. However if the storeid does not match entire integration is aborted and the error is sent back to the client.

One of the applications of the disconnected mode of venus, would be in compute intensive applications. The data to be operated on needs to be limited and could be a user based response system where service could be provided based on independent data. Also this would be suitable for applications such as text editor during updates where each local file buffer is independent. However the increased complexity of client could be a limitation. The author mentions that the clients are typically portable computers, and for them to act as pseudo server with the limited hardware configuration could be challenging. Further the clients local data in cache is highly dependent on its local applications, and hence is not transparent. It becomes obligatory for applications to specify either in HDB or perform prefetching in anticipation of failures. It would have been more interesting, if the clients were also involved in primary replica management and had coordination among the Venus instances. In such cases, limited user request could be serviced based on the data available in replicas, independent of client's past behavior.

Posted by: Rajiv | March 16, 2010 05:57 AM

Coda is the successor of AFS which claims to be first distributed file system aimed at university sized user domain. The presented paper discusses the Coda’s dealing with disconnected operations which is much more a scenario for file hoarding.

Though Coda got much more development impetus from AFS but main motivational force behind the present work is disconnected operations. Its always challenging in distributed environment to “continue critical work when there is inaccessible repository”. The immediate idea we may strike with is caching and server replication which enhances performance and availability.

Coda in its general organisation provides little sharing as it is tailored to access patterns in academic or research environment. Design considerations of Coda includes need for scalability, advent of portable workstations, hardware adoption, balanced availability and consistency as well as advocacy for selection of implementation strategy for these is provided by author. Clients cache on local disks and callbacks maintain cache coherence. The evolution of portable workstation makes it feasible for users to manually cache while they are disconnected. Coda has a single mechanism to deal with voluntary and involuntary disconnections. Server replication though is expensive but it is much more physically secure and persistent too. It assumes that no others are touching the sophisticated file resources for conflict detection. Also the fact behind implementing optimistic replica control is low write sharing in Unix type systems with high availability by providing access to anything in its range. Prioritized cache management is maintained with the help of hoard database. Venus connects the application requests with coda server and has three states as (1) Hoarding for normal operation mode, (2) Emulating for Disconnected operation mode and (3) Reintegrating state for propagating changes and detecting inconsistencies.

Optimistic replication and file hoarding addresses communication failures and voluntary disconnections in coda. According to paper, Hoarding is central to the Venus and it successfully implements dynamic prioritization by hoard walk. Emulation mode provides persistence to the system, also CML is managed in this mode. Further strong persistence is promised by the system as venus’s metadata updates are made through atomic transactions.

The single most dominant contribution of the paper is to provide ability to work while disconnected in most efficient manner. Authors observe that conflicts are infrequent; this can not be taken guaranteed for a universal solution. The concept of reintegration implemented in coda works well in strong connectivity environment but with slow connections there should be alternate mechanism to implement weaker connectivity. Callbacks are significant part of Coda system but while reading paper I feel there was bit shadow on this aspect. In overall effort for simplifying the complexity of implementing disconnected operations is solved in step by step manner.

Posted by: y,Kv | March 16, 2010 05:31 AM

This paper uses caching of data to improve the availability for machines which are mobile or having unreliable network and sometimes become disconnected to servers.

The main concern is how to keep systems working even the connection to servers is off. The approach is quiet optimistic as the design assumes that most of people use their local data at most of time. In fact, this is truth verified by the experiments later. However, the designers still need to make choices for the tradeoff between caching prefetched data for performance and caching current data for availability. One point interesting for design is that the system shall keep high-level transparency for users whether or not the connection is on or off.

The general design of Coda is based on AFS. Coda has three states. In Hoarding, the connection is on and system needs to caching files for both performance and availability. When connection is off, it turns into Emulation state. In this state, Venus works as a server and every update is saved. When physical connection comes back, the system goes into Reintegration state. In this state, the system needs to reconcile all the updates for the same files and then go back to normal working state-Hoarding state.

Coda contributes the optimistic approach for disconnected operations. Its cache management takes the offline operations into consideration and offers high availability as well as good performance. One good design is that it makes the connection to server transparent to users. So users don’t detect its disconnection to servers and continue to work well when the connection is off.

Coda is quiet applicable as the research group has run Coda FS in real computer labs. They find that the likelihood of conflicts is low. So their optimistic approach is applicable to normal uses. But I think Coda may not fit for the medical use as Coda uses whole file transfer. It works well for small files but doesn’t work well for those large medical images or videos. Coda inherits its model of trust and integrity from AFS, which is proved very well-designed by its wide use.

Posted by: Lizhu | March 16, 2010 05:30 AM

Summary:
The Coda filesytem can be thought of as an adaptation of AFS's philosophies to a optimisticly replicated setting with the possibility of communication failure. Coda relaxes the pessimistic AFS always connected paradigm by allowing disconnected updates and resolving conflicts later.

Problem:
Larger systems and portable clients pose new challenges to distributed filesystems. As networks get bigger, communication is disrupted with greater frequency, mobile clients also challenge the assumption of always-on networking used in existing distributed filesystems like AFS.

Coda tries to address the issue of inconsistent connectivity, both intentional and unintentional.

Contributions:
Coda is a system with multiple levels of replication. It fuses server-to-server(first class) replication with partially autonomous client replication(second class). Servers are able to tolerate partitions and resolve most conflicts without manual intervention. The same scheme is used between clients and servers, albeit with less symmetry.

The fundamental concept behind Coda is the trade-off between availability and consistency that we have consistently seen in distributed systems. Although the ideas in Coda were not groundbreaking, the authors did build and test a practical system. This proof of concept showed that optimistic replication can work in the context of file systems with good performance and minimal user intervention.

Practical Applications:
Coda continues to be used used today, although it was never as popular as AFS. One interesting application that has appeared recently is the combination of git (Linus's distributed revision control system) coda. It turns out that git's filesystem operations are all efficiently expressible in Coda. Coda+git also almost guarantees lack of conflicts unless two users concurrently write the same content to the same file. This is arguably not a dangerous operation.

Posted by: Joel Scherpelz | March 16, 2010 05:06 AM

Disconnected Operation in Coda File System

- Summary:
Design choices and Implementation of disconnected operation in Coda File System, which allows client to continue access data during temporal failures of shared server, is discussed. Disconnected operation leverages the idea of caching to improve availability.

- Problem:
During a temporal disconnected from shared central servers, how to make client operation still feasible? Which data need to cache? Should we choose thin client vs. fat client? Which replica control techniques should be used? How to efficiently resolve the conflict when reintegration?... These are important issues that needed to be addressed in order to make disconnected operation feasible and efficient.

- Contribution
I think main contribution of this paper is the detailed description of design choices and implementation techniques that the authors faced during the design and implementation of this system. This description exposed a lot of useful lessons that a system designer can learn.
+ The paper adopts a lot of technique to enhance scalability such as callback-back cache control, whole-file caching, and avoidance of system-wide change. All these allows simplified implementation.
+ Since the targeted uses of disconnected operation in this paper is portable machine with normal (i.e low sharing, not highly concurrent) workload, optimistic replica control is used to maximize availability.

To enable disconnected operation, the client need to some how anticipate crashing by pre-caching desired files, logs operation during disconnections, and propagate all the changes when reconnecting. The authors described how to implement this kind of thing in details, which is too long to elaborated here. But the key is they have some sort of optimization to make it work (e.g, mechanism to reduce the log size, persistent storage of metadata using RVM, etc)

- Flaw/Comment/Question:
(1) I think the idea of HDB is kind of cool, but it might not work for novice users. And the fact that user has to specify the hoard file reduce the transparency of the system (which is one of its goals).
(2) I don't get the optimization of reduce the size of log for *store* operation. The paper mentioned that a store record for a file is discarded when a new one is append to the log. What i am concerned here is how that previous store is excluded. It definitely requires a scan to the log to find that entry, and some how discard it. This operation to me is expensive.

Posted by: Thanh Do | March 16, 2010 05:03 AM

Caching becomes even more useful when its implemented to increase availability instead of just improving performance. The coda file systems utilizes cached files for availability purposes in the event being disconnected from host servers.

Ever been working on a project hosted on a distributed file system and been rudely interrupted by a failure on the other end. Coda looks to create a way around the service outage associated with remote failures by using a disconnected operation scheme. Never again will you be put out of commission by temporal network failures or remote system crashes, continue to work locally!

The option to continue working seems to be the largest draw of this idea. Files brought locally to the machine are cached on non-volatile storage and represent what you and Venus decided were the critical files. So really you can only continue to do what you were working prior to the outage. Venus' prioritized algorithm it uses to hoard walk is interesting because it tries to anticipate and gather information needed to operate in its emulation state.

Sorry for being bias and narrow minded but this concept seems dumb. I've personally encountered one instance of the problem described. It seems like a lot of complexity for very little cure. Introducing the possibility of inconsistency into a file server doesn't seem worth added functionality of continued pseudo-service in failure scenarios. Maybe if you system fails all the time, but this hardly seems the general case. Who intentionally implements a faulty system and then address that with patch like this?

Posted by: Jeff Roller | March 16, 2010 04:51 AM

Problem:-
To make "disconnected operations" work transparently on a distributed file system ( Coda ) by caching data on local disks present on the clients to allow disconnected operations on the unreachable file system.

Summary:-
The paper presents the design and implementation of the "disconnected operation" on the Coda distributed file system, where a user can continue working on a piece of data while the actual file system holding the data is inaccessible due to some failures. The goal is to provide high availability ( eg : using optimistic replica control) while being as transparent as possible. It presents algorithms for choosing files to be cached on the local client in order to be ready for a possible disconnection ( eg :- hoard walking to meet user's availability expectations). It also presents algorithms for managing the cache during disconnections and reconcile the local cache with the server after reconnection.

Contributions:-
The main contribution of the paper is to show the feasibility of a system that allows disconnected operations on a shared file system while providing very high availability to the users. In this direction, the paper gives insight into various design decisions and presents various techniques. It presents the concept of first/class replicas and the benefits of server replication in the presence of disconnections. The paper chooses optimistic replication in view of low write-sharing in linux. The paper also discusses the various tasks such a system has to perform : Hoard walking to choose files to be cached, Emulation to act as a pseudo-server during disconnections and Reintegration by using the replay log and conflict resolution.

Applicability to real systems :-
I can imagine a similar system being used in cloud based services for an ever increasing number of Internet based mobile users ( eg:- users with smartphones, netbooks) who can't be online all the time. Such services would face issues similar to those mentioned in the paper, resulting from the disconnected operations on the data. Such a system should make an optimum use of local resources on the clients with variable capabilities ( laptops vs smartphones ).

Posted by: Ashish Patro | March 16, 2010 04:38 AM

The paper proposes a filesystem, Coda, which would be able to handle the clients becoming disconnected from the servers, voluntarily or involuntarily. The goal was for users to be able to continue to work even without being able to communicate with the file servers, and to have everything handled transparently to the users.

The problem that the paper addresses is that it is difficult to anticipate when a user might become disconnected, and what files they might need in order to continue working. Voluntary disconnects are much easier, but unexpected involuntary disconnections pose problems. Another difficulty is that when working disconnected, a log must be kept of the changes being made, so that upon reconnection the server can be updated. A problem that the authors expected to encounter far more than they actually did was that two different users who were not communicating at the time might make conflicting changes, leading to difficulties in reintegration. However, they found that usually only one author edited any particular file.

The paper contributes the idea of a hoarding and an emulation stage -- in the hoarding stage, it tries to cache all the files that are important to the user, or that are being used. The emulation stage is after the client is no longer connected, and involves operating on the cached copies and keeping a log of changes that can be played back during re-integration, the third phase. The paper also contributes some quantitative data about the file modification patterns by users on the system. It also argues for optimistic (rather than pessimistic) replica control for both the client and the server replicas.

Coda asks the users to specify a list of files that they consider important enough to be cached; I think this would get annoying, since if you forgot one you might find yourself unable to get anything done, while if you include too many it might push more important ones out of the cache, and the list of important files might change frequently. I think a version control system such as svn is more intuitive to use, even if it is less transparent.

Posted by: Lena Olson | March 16, 2010 04:38 AM

Problem:-
To make "disconnected operations" work transparently on a distributed file system ( Coda ) by caching data on local disks present on the clients to allow

disconnected operations on the unreachable file system.

Summary:-
The paper presents the design and implementation of the "disconnected operation" on the Coda distributed file system, where a user can continue working on a piece of

data while the actual file system holding the data is inaccessible due to some failures. The goal is to provide high availability ( eg : using optimistic

replica control) while being as transparent as possible. It presents algorithms for choosing files to be cached on the local client in order to be ready for a

possible disconnection ( eg :- hoard walking to meet user's availability expectations). It also presents algorithms for managing the cache during disconnections

and reconcile the local cache with the server after reconnection.

Contributions:-
The main contribution of the paper is to show the feasibility of a system that allows disconnected operations on a shared file system while providing very high

availability to the users. In this direction, the paper gives insight into various design decisions and presents various techniques. It presents the concept

of first/class replicas and the benefits of server replication in the presence of disconnections. The paper chooses optimistic replication in view of low write-sharing

in linux. The paper also discusses the various tasks such a system has to perform : Hoard walking to choose files to be cached, Emulation to act as a pseudo-server

during disconnections and Reintegration by using the replay log and conflict resolution.

Applicability to real systems :-
I can imagine a similar system being used in cloud based services for an ever increasing number of Internet based mobile users ( eg:- users with smartphones,

netbooks) who can't be online all the time. Such services would face issues similar to those mentioned in the paper, resulting from the disconnected operations on the

data. Such a system should make an optimum use of local resources on the clients with variable capabilities ( laptops vs smartphones ).

Posted by: Ashish Patro | March 16, 2010 04:36 AM

Summary:
This paper describes the disconnected operation feature of the Coda file system, a file system intended for maximum availability in a primarily read-only distributed setting. The paper gives a design overview of Coda, and a rationale that includes the reasons for desiring to maximize availability and scalability. It covers design and implementation, a bried evaluation, and gives conclusions.

Problem description:
In the presence of network and machine failures, it may not be uncommon for a machine to lose access to a critical piece of information, without which it cannot proceed. If that information is already cached locally however, there really is no reason to not allow the client to speculatively operate on the data. Conflicts, if any, can be resolved when the failures are resolved. Coda's disconnected operation provides a method for doing exactly this: allowing clients to continue to work on data in the presence of partitions or server failures, as if the failures didn't exist. The authors describe the challenges, implement a solution, and provide experimental data.

Contributions:
This paper is rather simple, but it's effective in its simplicity. The authors claim that they are the first to exploit caching for availability as well as performance. If this is true, then they provide a good job of laying out the issues, and showing that consistency need not be greatly compromised in maximizing availability through disconnected operation. The authors also do well at motivating their approach and articulating the differences between optimistic and pessimistic replication.

Applicability:
This work almost seems naive from a modern perspective. Why would you not do this? To build a truly scalable system you would have to. In fact, as the authors pointed out, we do "disconnected operation" all the time with our laptop computers, and in our daily functioning within society. The connection I drew with a modern application was with distributed version control systems, such as Git. Git takes disconnected operation to its greatest extreme, saying that there is *no* master copy. Everybody is disconnected until they choose to connect, and then they share if that is what they want to do.

Posted by: Marc de Kruijf | March 16, 2010 04:23 AM

Summary:
The key idea of this paper is that caching of data, while commonly used to improve the performance, can also be used to enhance availability of critical data. Data replication employed in caching enables disconnected operation. Disconnected operation allows a client to continue using cached data while a remote host is experiencing a failure. Authors demonstrate how disconnected operation can be implemented on the Code File System.

Problem:
The main problem is that in the absence of disconnected operation, once remote host[s] responsible for serving the data is[are] down, the data effectively becomes unavailable. Disconnected operation makes it possible to still use the data of the failed remote hosts. The challenge is to implement disconnected operation in a transparent, scalable, and portable manner.

Contributions:
Scalability is promoted by pushing the functionality on the client side. The notion of first vs second class replication is introduced. Pessimistic and optimistic replica control approaches are discussed, both advantages and disadvantages. Due to low degree of write sharing as well as transparency considerations, optimistic replica control is selected. Lastly, the authors provide a detailed description of the system, including the protocol and the novel hoarding aspects of the caching. Once the server nodes are up, reintegration via log is performed.

Comments:
This seriously reminded me of Bulk SC, transactional memory and so forth.
What makes me worried is the situation in which many files for a particular event are available, but then some critical file is not. So, in the context where requests proceed like transactions (all or nothing), once it is discovered that a particular file is unavailable, care must be taken to ensure that this doesn't cause some major problem. To be more precise, there may be some operations that assume that if one file is available, another file is also available.

Posted by: Polina | March 16, 2010 03:52 AM

Problem addressed:
This work tries to address the problem of maintaining availability of a shared file system even under disconnection from the network ( i.e. in total isolation). The authors proposes enhancements that can allow clients to continue operations on a shared data repository even under disconnection.

Summary:
The basic idea behind this paper is to use client side caching to provide availability during disconnection. The authors proposed use of whole file caching at the client to allow continue operation on the cached file under disconnection. On reconnection the proposed system offered "best-effort" reconciliation of the local client copy of the file with the replicated repository servers. To achieve this the system keeps a "replay log" at the client side for the operations done during disconnection. Later on reconnection the replay log is used to perform the operations on the replicated servers. If during reconnection, the system finds conflicting operations have happened it cancels the updates done during disconnected mode and saves the log of operations for later analysis. For the system to be effective, it needs to have files to be accessed during disconnection available in the client side cache. This problem is synonymous of predicting the future. The proposed system offers a "best-effort" mechanism in this regard by analyzing history of access to figure out which files likely to be accessed in near future (implicit information) and also allows user to provide hints (explicit information) in this regards. If during disconnected operation, client tries to access a file not available in its cache, it simply fails the operation. During normal operation when it is connected to a set of replicated repository servers, it uses callback based methods to keep the client cache coherent.

Short summary:
This paper proposes to use client side file caching to ensure availability of a shared file repository under total disconnection of the client from the file servers. It uses whole file caching and prioritized cache object management ( with both implicit and explicit feedback) in order to provide a best-effort service to allow disconnected operations

Relevance:
With proliferation of portable computing platform ( e.g laptop, iPhone) and the idea of shared file system, ability to support even restricted or best-effort discounted operation on shared repository data, is valuable. They have very naive conflict detection and no conflict resolution mechanism when conflicting operation happens during disconnected operations. They justify this design by showing that seldom different users modify same file/directory structure within small time period. I was also wondering how the conflict detection/ resolution management of shared repository systems like SVN, Clearcase or CVS can be applied in disconnected operation environment of Coda.

Posted by: Arkaprava Basu | March 16, 2010 03:48 AM

This paper discusses about providing file system service even during the scenarios when the client is not able to reach the distributed file system server.

The main idea presented here is to make use of the storage capacity and the data available at the client's end to ensure progress in the system during the server being unavailabile or not reachable. This propety is named as disconnected operations which is offered by the Coda file system. Coda is a distributed file system like its precursor AFS. The important characteristics of the system are server replication and disconnected operation targeting scalability and availability.

The design decisions are presented with reasonings to support their claims like callback-based cache coherence, use of replicas in server, adpating optimisic replica control etc. Venus (as found in AFS) is the file service provider from client's perspective and a cache manager from server's perspective. Venus operates in one of the three modes (1)Hoarding - Tries to accumulate as much valuable data as possible before disconnection could happen. This makes use of hoard profiles, prioritized cache management and hoard walking, a mechanism to ensure high priority data is being cached. (2)Emulation - Venus acts as the main service provider when the server is unavailable. The client's actions are stored in replay log which is used druring reintegration. (3)Reintegration - Activities during disconnected phase are synced with the server.

Contribution: (1)The ability to perform offline file system operation. (2) Hoarding can also be thought as a process of prefetching which can improve performance in distributed systems. (3)The persistence ensured during emulation phase which enables the system to resume work even after restart.
Example for voluntary disconnections - something like source/version control repository (checkin/checkou as needed) and for involuntary disconnections - mail clients. It is interesting to know if Coda has inspirations from these applications or the reverse happened.

Questions: (1)The second optimization (inverse operations) - Are these kind of operations more common? (2)Is the conflict handling protocol different for resolving conflicts between replicas and during reintegration phase?

Posted by: Sankaralingam Panneerselvam | March 15, 2010 09:55 PM

CS 739 Review Blog

Disconnected Operation in the Coda File System

Comments

Post a comment