CS 739 - Reviews - Spring 2011: The LOCUS Distributed Operating System

« Distributed Computing in Practice: The Condor Experience | Main | Time, clocks, and the ordering of events in a distributed system »

The LOCUS Distributed Operating System

The LOCUS Distributed Operating System, Bruce Walker, Gerald Popek, Robert English, Charles Kline and Greg Thiel. Proceedings of the 9th Symposium on Operating Systems Principles, Nov. 83

Reviews due Thursday, 2/3

Posted by Michael Swift on January 27, 2011 08:23 PM | Permalink

Comments

The LOCUS distributed operating system was developed with the specific but quite ambitious goal of presenting all users on all workstations with a single and consistent interface. All fundamental OS functions are considered, though this review focuses on the implementation of the file system.

The primary concern of the authors and creators of LOCUS is creating "network transparency," which is the illusion that all actions are performed locally despite the fact that resources are spread out among many sites. With regards to the filesystem, they seek to allow identification of any given file through a unique file path common to all machines in the system. Through replication, the added benefits of availability and reliability are sought, though this produces further challenges in the area of consistency in the face of updates.

Besides being a valuable case study exposing the various challenges and viable solutions towards the goal of network transparency, this paper introduces several specific concepts that outlived the system itself. The desire to allow file updates in the presence of network partioning created the necessity of merging multiple changes within a file; this brings to mind functionality currently found in content management systems such as SVN or Git. The concept of transactional file systems developed here also lives on today. Finally, it's possible if not likely that the obsession of LOCUS's creators with network transparency for their filesystem could have influenced the developers of NFS developed by Sun shortly after this paper.

A blatant and surprising omission from this paper is the complete lack of motivation for solving the problem. Incredible efforts are made and difficulties endured to shield the user from being exposed to the notion of a network, but there is very little evidence provided that this is a feature that users actually care about. Certainly it can be assumed that the computing experience of a user is simplified if she doesn't need adapt to a new environment when switching workstations, but are the admitted performance concerns inherent to the solution (including the fact that the transparency disallows the choice of not making a remote request) overtaken by the benefits?

Clearly the LOCUS operating system must have provided benefits to the users at UCLA while it was in use, but providing proof of the concept that such a system could is feasible and at least somewhat usable is probably the greatest contribution of this paper. By introducing the idea of a completely transparent network, LOCUS undoubtedly influenced the perspective of many developers and architects when considering the potential of networks and their usage.

Posted by: Rich Joiner | February 1, 2011 09:05 PM

Summary:
This paper presented LOCUS, a distributed operating system which provides network transparency by manifesting itself like a system which runs on a single machine. This transparency is achieved through lots of mechanisms, including a distributed and replicated filesystem, and dynamic partition reconfiguration.

Problems:
How to build a distributed operating system which hide its distributed nature from the user, yet being robust, highly available, free from consistency problems, and offers reasonable performance.

Contributions:
1. This paper developed a distributed and replicated filesystem model. By replicating files into multiple sites and having CSS being a centralized place to organize potentially conflicting access to the same file. A transaction like commit mechanism also prevents leaving file in an inconsistent state.
2. Although the problem of detecting and reconciling conflicts of files caused by partitioned updates are well studied before, this paper does offer some experiences and insights in implementing those mechanisms in a distributed operating system context.
3. This paper developed a dynamic reconfiguration protocol, which could automatically adjust to network topology change, a situation that must be dealt with by every distributed system. Though this three stage repartition-merge-recover protocol is inherently complex, it does offer a complete solution for network partition.
4. Most importantly, the author showed that building such a distributed, network transparent operating system is feasible and presented a prototype of it.

Flaws:
Though the authors strive to provide the so-called network transparency, and used many complex mechanisms to achieve this. They barely offer any reason why this transparency is so desirable. Nor did they clearly compare the benefit and overhead of this feature. (Merely stating that “the performance penalty is not significant from our experience” doesn’t count.)

Relevance:
To my knowledge I don’t know intensive use of distributed, network transparent operating system today. However, the transaction like file commitment protocol and the dynamic reconfiguration protocol could still find its use in other distributed systems, I believe.

Posted by: Suli Yang | February 2, 2011 09:27 PM

This paper presents a broad review of the LOCUS distributed operating system architecture. The main goal of LOCUS was to make a network of separate machines appear to work together as one large reliable machine. The authors detail several major components of the system including process management and dynamic network partition recovery, but the most focus is on the distributed file system.

The LOCUS file system addresses a number of problems that arise naturally in a distributed context. First is network transparency, a user should have access to any file in the system and should not need to know where it is located or if it has moved. Second is consistency, file modifications should be atomic and no two concurrent writes should intermix. Finally is reliability, the system should not lose any data after partial failures and it should continue operate when they do occur.

LOCUS provides network transparency in the filesystem using a common namespace shared across all machines in the system and uses a dedicated Current Synchronization Site (CSS) to enforce global synchronization policies and to locate nodes that store the file data. This makes the file I/O API the same regardless whether the file is local or remote. File writes are made atomic by first disallowing concurrent write access at the CSS and by using a shadow page based commit mechanism for writes (similar to databases). Finally, LOCUS provides file system reliability through file-level replication. Replication occurs entirely on the storage site/CSS side, by propagating changes once a primary storage site has been updated.

As other reviewers pointed out, the authors did not present a sufficient motivation for their work. It seems reasonable to presume that the high cost of computers at the time drove the motivation to maximize the utility of a small number of machines. The abstractions provided by LOCUS would make a single system more useful for a single user (more storage and computational capacity), and hopefully it would avoid idle single machines. However, with hindsight it is clear that the precipitous fall of hardware costs eventually drove the market toward personal computers with simpler non-distributed operating systems despite the inefficency of idle hardware.

Nevertheless, several of the components that made LOCUS work do seem still relevant. Particularly, the filesystem presaged NFS and other networked file systems. While they function very differently, both LOCUS FS and NFS provide a transparent interface so that the user need not be concerned whether the file is local or remote. LOCUS also provides a nice approach to detecting and responding to network partitions and merges without interrupting usage of the system that could be very applicable to current distributed systems.

Posted by: Kris Kosmatka | February 3, 2011 12:31 AM

I will review the distributed file system part of LOCUS, because "file system activity typically predominates in most operating systems".

== Summary ==

LOCUS provides a replicated distributed file system, featuring network transparency, high performance, and robustness.

== Problem / Goal ==

There are primarily three goals of designing the distributed file system of LOCUS: 1) Network transparency. 2) High performance. 3) Robustness.

== Solution ==

To achieve the above goals, LOCUS provides the following mechanisms:

* Replication improves availability and the performance of read workload -- for example, top level of directories in the hierarchy are seldom modified, so they have large read-access performance gain from replication.

* Lazy update allows multiple replicas of the same file across different sites to have different version of contents. This relaxes consistency a bit, but increases overall performance.

* Current synchronization site (CSS) acts as a master node or match maker in coordinating file accesses. CSS keeps a global view, and makes high level global decisions. Therefore, CSS may improve file access performance.

* RPC-like communication mechanism contributes a lot to network transparency.

* Read-ahead technique is used in sequential read, either within a single site, or across multiple sites. This improves performance.

* DBMS-like transaction commit semantic is provided, improving robustness.

* Shadow page mechanism is implemented at SS to support commit semantic, and is transparent to US. Thus, this provides some transparency to users.

== Flaw ==

It seems that scalability is not well considered in LOCUS distributed file system design. For example, Mount mechanism stores all sites' state information at every site. This would harm scalability.

In addition, caching is not considered, which is critical to performance.

== Relevance ==

There are two relavent distributed file systems starting at roughly the same era as LOCUS, namely NFS and AFS.

Similar to LOCUS distributed system, both NFS and AFS maintain UNIX semantics, provide location transparency, and use RPC for communication.

Different from LOCUS, both NFS and AFS use caching to improve performance, while LOCUS doesn't.

Posted by: Wenbin Fang | February 3, 2011 12:49 AM

Summary:
The paper is about Locus which is a distributed OS with a transparent network wide file system.

Problem:

The paper aims convincing the reader that a high performance, fault tolerant, network transparent distributed file system with remote processes is feasible even on a small machine environment.

Contributions:

The Locus system tackles the problem of network transparency by providing for a generalized name service. It provides for good recovery mechanism from failures where it divides its machines into partitions so that the system continues to function even if parts of the system fail. It provides for better availability through replication. Replications also provide for faster read access to the users from closer copies. It provides for atomic commits like in databases and makes use of shadow paging to provide this feature. Atomic commits provide for greater consistency. All of this is implemented with lesser performance tradeoff from a conventional UNIX system.

Flaws:

1.> One main flaw I see is it that lot of state information is maintained by various components, every minute failure has to informed to all nodes

Applications:

1.> This is analogous to a cloud storage system of modern times, Amazon provides for storage as well lending machines for computation

Posted by: Vinod Ramachandran | February 3, 2011 02:34 AM

SUMMARY
The paper describes a distributed operating system LOCUS which aims to achieve transparency,fault tolerance and robustness to a dynamically changing environment. The important aspects of LOCUS namely the file system,distributed process execution,recovery and dynamic reconfiguration are presented.
The distributed file system is given importance in the following review.
PROBLEM
We have with us a typical distributed system of a network of nodes where we try to achieve our tasks in an efficient manner. However we are confronted with issues like transparancy, dynamic behaviour, data consistency and handling node and other types of failures. LOCUS strives to solve all these problems.
CONTRIBUTIONS
The file system presents a uniform interface as unix to the user programs therby hiding the underlying semantics of the distributed system.Global identifier is given to a file that facilitates dynamic data migration. A uniform naming mechanism is also presented. Also, Replication is used to improve read performance. Transactional semantics for file commit-A shadow paging technique for storing both changed copy and old one to respond accordingly to a commit or abort message.Also by enforcing that only one syncronisation site can hold a particular logical file group, global access synchronisation is maintained.
In addition to the file system, a distributed process execution(a remote fork,exec and error handling) is described. Parititoning is described as a way of recovery. A dynamic reconfiguration technique that partitions the nodes into subnets and uses a merge protocol to resolve conflicts is also expalined.
FLAWS
LOCUS' file system is a Stateful file system as opposed to NFS a stateless file system. The conclusion of the paper says "further work is needed to assure SCALING to a larger network will successfully maintain performance characteristics". With so much state information present in US,CSS,SS and shadow pages and page tables it is hard to believe that the system would scale.
Also, no thorough evaluation is presented in the paper to establish the fact that performance was indeed not compromised.
RELEVANCE TO MODERN SYSTEMS
Even the AFS used in our deparment intends to provide the services LOCUS provided(transparency,fault tolerance). However AFS is different in the sense that it has caching on local disks. Also Partitioning and replication are the most commonly used techiniques in modern day distributed systems to improve availability,performance and fault tolerance.

Posted by: Karthik Narayan | February 3, 2011 02:45 AM

summary
LOCUS is a distributed Operating System with its main goal of providing high performance, transparent, eventual consistent and reliable Operating system to the users concerned. This paper gives an overall good picture of challenges involved in designing a distributed system.

problem:
Given the distributed nature of operating system, It is very important that users and programs running on such systems are unaware of this fact. To provide such abstraction LOCUS provides enough transparency in its design. LOCUS provides above specified goals by having replication of storage, following generic semantics for communication, remote process creation, merging and reconciliation of inconsistent files and adjusting to the dynamic configuration of the network with equal concentration on failure and error management.

contributions:
-All the semantics and design issues are inline with Unix, so that there is not much drift in the semantics of the system
-This paper also enforces the fact that there is comparatively more reads than writes to the data and probability of conflicting updates is low.
-Having CSS for every file provides a global view of the file, synchronisation and many maintaining aspects of files become easy.
-There are many CSS’s, so there is no single point of failure and the load is distributed among CSS’s.
-Provides loose semantics for eventual consistency of the data.

Flaws :
-The paper refers to use of tokes to implement sharing files between processes, but some how this does not convince me, since the performance degrades as communication increases between remote processes.
-Also how efficiently signalling can be done between processes running on different machines is not known.
-The mounting information is stored at all the sites, so it may pose a problem for scalability of the system.
-Data transfer between the systems is in terms of pages. This would be bad if the data requested is random and less than page size.
-Changes to the files are atomically committed. The whole process involves network too. There may be high latency in performing such tasks which can affect availability of the data.

Relevance:
-The ideas presented in this paper are followed in present day Network file systems, distributed computing(grid computing), cloud computing and cloud storage.

Posted by: Pratima Kolan | February 3, 2011 05:00 AM

Summary

The paper focuses on the system architecture of the LOCUS Distributed Operating System. It provides an overview of the design details of several key components of the O/S including the file system and handling and recovery from the network partitioning.

Problem

The aim of the LOCUS system is to provide the users and processes of the system with a uniform and transparent view with regards to access of local and remote resources in a distributed environment. Their aim was to remove any users knowledge an object’s locality and allow access to an object (whether local or remote), be done via a single interface. Furthermore, the LOCUS system aims to increase availability (of objects), performance (the accessing of an object) as well as consistency (via centralized synchronization) of the distributed environment via replication.

Contributions (Distributed File System)

The LOCUS system made several contributions to a distributed file environment. First and foremost, the transparency in which any object may be accessed by a user/process. Furthermore, their transparency mechanism allowed a process to be moved and executed on any remote machine and correct access to an object was guaranteed without any changes being made to the process.

Additionally, the LOCUS allowed updates during network partitioning. Unlike existing systems, which disallowed updates once network partitioning occurred, the LOCUS system allowed the modifications to be made, checks for and detects any conflicts within the replicated versions of an object and attempts to resolve the conflicts via merging or informing the owner of the file of the conflict.

Flaws (Distributed File System)

There are several potential flaws in the system design of the LOCUS file system. Firstly, as there is only a single CSS node for a particular filegroup, failure of a CSS node results in the entire partition being unable to access any files stored within a filegroup maintained by an inaccessible CSS node.

Additionally, assigning a filegroup to a single CSS node poses scalability problems as the size of the system increases and the number of file access requests increases accordingly. This is especially likely, as the CSS is involved in the directory name resolution when accessing a file, which may involve several remote calls.

One other potential shortcoming of the system seems to be the replication mechanism. The aim of the replication was to increase availability and performance of the system. However, the number of copies of an item are determined at create time, and no mention is made to adjust this number later on. Rather a more flexible protocol should have been maintained allowing the system to monitor the frequency and access mode of an object and when necessary to improve performance, increase/reduce the number of copies of an object.

Relevance

While the LOCUS system is no longer in use today, the principle of transparency lives on in today’s distributed environments. The conflict detection mechanism is also seen today in most versioning control programs.

Posted by: Greig Hazell | February 3, 2011 06:33 AM

CS 739 - Reviews - Spring 2011

The LOCUS Distributed Operating System

Comments

Post a comment