Distributed Computing in Practice: The Condor Experience
D. Thain, T. Tannenbaum and M. Livny, Distributed Computing in Practice: The Condor Experience, Concurrency and Computation: Practice and Experience 17, 2-4, February-April 2005, pp. 323-356
Review due for this on Tuesday, 2/1
Comments
Summary
In this technical paper, the authors introduce the Condor system and its technical details. They explain the Condor design philosophy, the Condor software, its architectural history, planning and scheduling in Condor, and Condor problem solvers. They also touch on how security is managed in Condor as well as bringing up some cases studies which demonstrate how Condor has been used in the real world.
Problem Statement
The problem that the Condor project addresses is how to have a distributed system running on PCs that are owned by different users and/or organizations with different policies. Moreover, we need to enable heterogeneous jobs to run safely, and reliably in this environment and isolate the owners from any harm that the jobs might cause.
Also, we want to be able to use resources from pools of PCs in different locations of the world.
Contributions
In this paper, a successful distributed system is described and its essential properties are explained. I found reading this paper very valuable and some important contributions of the Condor project include:
The final design is dependent on the resources that we have available and the kind of jobs that we intend to run on the system.
For example, if we compare this paper with the Google paper that we read before this, we can see that Google uses highly dedicated servers and runs mostly data intensive and highly parallelizable applications on them.
However, Condor uses spare resources on machines that have different owners with possibly different policies, and runs a wide variety of jobs. These jobs can be highly heterogeneous in nature (some data and but seems mostly computation intensive). Some differences in the design of these systems are the following:
Limitations
It seems that data-intensive jobs which are quite frequent today will have a hard time running on Condor. This is mostly to the fact that PCs that Condor connects together are by nature not co-located, and thus locality is very hard to achieve. Therefore, it seems that there will not be enough network bandwidth for data-intensive jobs to run on Condor with reasonable latency.
Also, this is obvious that the resources on the Condor distributed system are by definition limited to the PCs that are organized under this system. Therefore, one should not assume that we could completely eliminate the need for buying dedicated servers by utilizing Condor. This depends on the kind of jobs that we try to run. It is true that we can use resources from other Condor pools as well, how ever, this seems to be a little bit of challenge as we need organization and user level agreements in order to do so.
Applications in the real world
It is clear that Condor can have a lot of application in the research community and businesses that do not have a lot of data-intensive jobs to run on. This will save them the cost of buying dedicated servers for the task. Case studies at the end of paper are indicative of this application.
Posted by: Fatemah | January 30, 2011 12:08 PM
Summary:
This paper presents a historical overview of Condor, a system for opportunistic, high-throughput computing across heterogeneous clusters of machines. Condor fulfills its purpose by using a system architecture that clearly defines the responsibilities of each component. The result is a system that simply and flexibly meets the desires of both the users and owners of computational resources.
Problem Description:
Condor attempts to solve the problem of providing a platform for large, computationally expensive computations utilizing distributed clusters of commodity machines from different administrative domains. This problem garners interest because a solution could provide large computation at a lower cost as compared to systems designed from the ground up for large, computationally expensive work such as supercomputers. Cheaper computation allows a wider variety of scientists, researchers, and practitioners to explore problems that wouldn’t have been otherwise investigated. The paper makes allusions to previous work that attempted to solve the problem with some success, but this work failed when applied in a different environment.
Contributions Summary:
Condor has contributed or refined a large amount of ideas related to distributed computing. From the large number of contributions, I found the following contributions to be particularly important or interesting.
I think Condor’s biggest claim to fame is the idea of opportunistic computing. Idle machines are often seen as a waste. Opportunistic computing allows idle cycles to be utilized for meaningful computation. Condor provides a means to manage and allocate these idle cycles according to the desires of users and resource owners.
I also particularly liked Condor’s solution for transparent remote I/O. By introducing a different linking step when building programs, Condor replaces the standard system call interface for I/O with a level of indirection that effectively provides a job with the illusion that it is executing in its originally conceived context.
Shortcomings:
Condor targets a specific type of problem, namely problems that don’t require fast response times. That is, a job for a specific problem can be submitted and runs for a significant period of time so that the time lost to resource arbitration and data collection are amortized over the long running time. So for example, Condor could not be used in a web service setting where response time is usually the highest priority.
As more of a minor flaw, the paper mentions that the Parrot system uses the system debugger interface to intercept system calls that make references to data in a file system. Without knowing their motivation for this approach, this is inefficient because a debugger trap requires the debuggee to stop while the debugger processes the trap. I think an approach similar to what is done for remote I/O should be employed in this situation, for efficiency purposes.
Application to real systems:
Condor is a true testament to how to build a system that can evolve with changing technology and continue to meet the demands of users. As the paper states, the Condor architecture hasn’t changed since 1988. I think this indicates that the lessons learned through the development of Condor contain significant value to any system builder and can be applied to any real system, regardless of application domain.
Posted by: Dan McNulty | January 30, 2011 08:03 PM
Summary:
This paper is a good summary on the distributed system called Condor. It provides us the history and philosophy of the project along with a description of its core components.
Problem:
In the 1970s it became clear that distributed computing was feasible. But it was also clear that it would be difficult. There were issues regarding message losses, corruption and delays. The need for consistency, availability and performance in a distributed environment was a new challenge. The Condor system was built with the aim to tackle these issues.
Contributions:
Condor is a distributed system whose design philosophy is “being flexible”. Over the years it has aimed at tackling the organic growth in computing in an efficient manner. The system allows the members of a group to have their own policies. It tries to make use of desktop machines with minimal configurations. It is also designed by borrowing and lending ideas from the software engineering/research world.
The System contains of two main software products. The Condor high throughput computing system and the Condor-G agent. The first price of software is like a fabric management service for one or more sites. It provides for job management, scheduling policy, and resource management. The second piece of software is a reliable submission and job management tool for one or more sites. It uses GRAM protocol of globus toolkit to provide uniform interface for job submission.
Both the above pieces of software help the user in using the Condor distributed system. The system allows the user to join multiple communities through gateway flocking and direct flocking. It also allows an agent to create private condor pools across communities using gliding. It combines planning and scheduling in order to evaluate jobs in an efficient manner.
One of the important features of Condor is that it provides for Split Execution. In split execution the user side is represented by an object called Shadow and the resource side is represented by a SandBox. The Shadow is responsible for providing arguments and executables for running the job correctly. The SandBox is responsible for providing a job the correct environment for execution and for protecting the resource from any malicious attacks. This split execution is cooperative way of execution of jobs between an agent and the resource.
Condor also provides for secure communication and secure execution. With its features it has definitely provide a good solution to many of the difficulties in distributed computing.
Flaws:
1.> Condor is not a guaranteed service, it compromises on availability. One could submit a job and then wait for several hours before is starts running due to the lack of available resources.
Applications
Condor is used in many places like in:
1.> human genome process
2.> Video rendering
3.> Computational Analysis of Semiconductor Chip
Posted by: Vinod Ramachandran | January 31, 2011 02:06 AM
Summary: Condor is a well-developed system for identifying unused compute cycles and making them available to other users both within an outside an organization. Important problems considered by the Condor project include representing diverse management policies in a scheduling system and securely executing untrusted code without placing a large burden on programmers.
Problem: The computing needs of a reasonably sophisticated user can vary considerably over time. Condor addresses the problem of smoothing out discrepancies in computing needs and capabilities caused by this variation. This service allows users to rapidly acquire more computing power, while preventing waste of excess power. Without Condor, and organization would need to build specific infrastructure for its most advanced computing needs. Instead, it merely has to ensure its entire infrastructure, including parts shared by partners, can collectively meet all its computing needs.
Contributions: Condor combines computing resources across a variety of boundaries, including geographic and organizational lines. Recognizing that it cannot impose a uniform structure on resource availability and expect all resource owners to cooperate, Condor opts to remain as flexible as possible with respect to resource scheduling and leave the owner in control. The owner can even instantly rescind resource allocations, which increases their trust and encourages them to keep their resources in the system. Condor's flexibility is made possible by a special language for describing resource availability and requirements. Unlike some other batch scheduling systems where resources are defined by a fixed set of configuration options, Condor's ClassAds claims to support any policy.
Condor's role is still important, because many organizations still have the kinds of heterogeneous computing resources that Condor specializes in scheduling. The current trend towards cloud computing or even private clouds still involves an added cost that can be avoided if an organization can scavenge cycles from its existing resources to satisfy its needs. However, there may be applications better suited to a low-latency datacenter cloud environment. An example may be data-intensive applications, as data still needs to be staged to multiple sites in a distributed Condor setting.
Flaws: For a system that executes code from widely distributed sources, Condor makes a dangerous assumption that jobs will be reasonably well-behaved. Even with the existing safeguards, poorly or even maliciously crafted jobs can still cause minor headache for resource users. The very nature of cycle scavenging can incentivize resource owners to configure Condor to not serve the best interests of actual resource users, increasing the likelihood that background jobs will cause problems in subtle yet noticeable ways.
Another flaw of Condor is that it does not mask the extreme unreliability of the underlying computing resources. The check-pointing system only works for certain kinds of jobs, and job creators must otherwise segment their work into small units of time to avoid losing results from resources that become unavailable.
Applicability: The paper discusses several real systems where Condor has succeeded. The most promising are those that exist within a single organization, because the policy negotiation and latency constraints are not as ambitious as a worldwide Condor grid. The worldwide map in the paper depicts mostly independent pools, suggesting that the touted inter-organizational flexibility of Condor has its limits.
Posted by: Todd Frederick | January 31, 2011 10:01 AM
Summary: The paper discusses the mechanisms employed by Condor for allocating and utilizing a distributed heterogeneous pool of compute resources. Class advertisements, matchmaking, problem solvers, shadowing, sandboxing, and checkpointing are all critical components that have evolved over time and contributed to Condor's success.
Problem: The high-level problem addressed by the Condor system is providing users with batch compute capabilities from a pool of sometimes available, heterogeneous resources. Unlike traditional distributed systems, Condor manages hardware resources whose primary purpose are serving specific individuals, not participating in a batch computing resource pool. This means Condor must carefully manage resources to ensure resource owners' needs are met before serving the users of the distributed system. Owners may frequently come and go, requiring Condor to constantly deal with changing resource availability. Furthermore, Condor must provide a mechanism to support a heterogeneous mix of owner policies when assigning tasks to the compute resources.
At the same time, users of Condor pools expect their jobs to be completed to their specifications. Jobs may require specific memory, CPU, and disk resources to complete. Additionally, jobs may be composed require tasks to be completed in a specific order. Users of Condor are well aware of its distributed nature, designing jobs which are composed of many small tasks. However, users want changes in resource availability and the heterogeneity of computing environments to be transparent to the execution of their jobs.
Contributions: Condor employs several key designs to addresses the problem of building a batch compute system from sometimes available, heterogeneous resources. First, class advertisements and matchmaking allow resource owner policies to mesh with Condor user requirements. The relatively unstructured class advertisements provide the necessary flexibility to accommodate the heterogeneity that is unique to the Condor distributed system. Second, the shadow and sandbox address the homogeneous desires of the users and the system protection desires of owners. The standard universe and Java universe provide consistent environments across all machines in a Condor pool, and the universes allow execution to be checkpointed if resource availability changes and task migration is required. The sandbox also protects resource owners from malicious or out of control Condor jobs.
Flaws: The requirement for a Condor pool to employ a single, logically centralized matchmaker is one flaw in Condor's design. While the design makes it easy to match resources to jobs, it has the potential to become a bottleneck in the system. Reliability is also a concern, although Condor can re-establish matchmaking state by requesting information from agents and resources in the event of matchmaker failure.
Applicability: Condor is a widely deployed tool in active use by many researchers around the world. It is a real system that must evolve to meet the changing technology and functionality demands of users and resource owners. However, the ideas employed in Condor also have applicability to other distributed systems. Most notably, the sandbox which provides a homogeneous environment and checkpointing can be employed in distributed systems which need to easy move jobs between hardware resources. Although cloud computing has dedicated owner resources and uniform owner policies, the heterogeneity of user requirements is similar to Condor. Here, mechanisms such as class advertisements and matchmaking are similar to the resource assignment requirements in cloud computing.
Posted by: Aaron Gember | January 31, 2011 10:26 AM
Summary
This paper presents the evolution and structure of Condor, an opportunistic high-throughput distributed batch computing system. Once a user has submitted a job, Condor is able to locate idle resources, distribute the job to these resources in a manner safe for both the user and the resource, and return results back to the user.
Problem
Condor provides a distributed computing model in which users are able to take advantage of free resources without worrying about the intricate details of ensuring this system works correctly and fairly. Unlike contemporary systems such as Locus and Grapevine, Condor provides opportunistic, high-throughput computing that flexibly adapts as resource availability changes.
Contributions
Condor provides a cooperative computing environment in which participating machines are used for batch job processing in a flexible and unobtrusive manner. Users submit jobs to the system, and when an appropriate machine or set of machines is available, the job is executed. Each of the three main components--agents, resources, and the matchmaker--enforces various policies put forth by system users. Agents enforce user policies, resources enforce machine owner’s policies, and the matchmaker enforces general community policies.
Flexibility and control are key contributions of Condor. Users have the ability to make requests of the machines that will execute their jobs and machine owners can likewise restrict the jobs that can be run on their system, both using the ClassAd language. Thus, users are provided with enough control that it is unlikely they will be adversely affected by using Condor; this is especially important for machine owners, who might not want to participate if they could not explicitly control how and when their resource is used. When a resource leaves the pool, jobs current running on the machine can seamlessly be migrated to another machine in the pool.
Protection is also a key component of Condor and an important contribution. Using shadows and sandboxes, users are assured that their job can run effectively and with the correct resources, and machine owners are protected from any harm caused by the job.
Flaws
Job isolation and security may be a more fragile aspect of the system. Although jobs are shadowed and sandboxed to protect the user and machine owner, it seems possible that malicious jobs that understood how the execution environment was constructed might be able to break out of the sandbox and affect the machine. Such jobs might not only cause security threats but also break the model in which the resource specifies when a job can run: for instance, a job may be supposed to run only while the resource is idle, but continues to run even after the machine owner has begun to work again. Jobs might even inadvertently cause some or all of these behaviors--just last week there was a Condor job that caused hangs on CSL machines.
Applicability to Real Systems
Condor is extremely applicable to real systems because Condor is a real, deployed system that is used by real clients. The evolution of Condor provides an important history of user needs and the corresponding design decisions. Further, Condor addresses issues of flexibility, reliability, and security that are relevant for many kinds of distributed systems.
Posted by: Emily Jacobson | January 31, 2011 02:31 PM
Summary: This technical paper describes the design ideas, evolution and the working of distributed computing system Condor.
Problem statement: Lots of problems (Condor case study examples) need large amounts of computational power. Organizations have a lot of workstations at their disposal which are not busy all the time. If we could only harness this raw compute power.. But putting together a system with all the desirable properties of a DS -- scalable, fault-tolerant, secure, performant, highly usable -- is not so easy. Condor is such a system.
Summary of contributions: DS architects can take many lessons from Condor. Large systems are complex by nature - and hence should incorporate flexibility as a founding principle. The design should give room to natural growth of a DS (scalability and extensibility). The harnessing of power should be as less intrusive to the user as possible. Organizing of cluster nodes into communities simplifies design and enables us to enforce policies of admittance and usage. There is a distinction between planning (how to get a job done) and scheduling (how much will a resource be able to contribute). Security should be built-in at multiple levels. Abstracting out system layers and resources aids homogeneous design.
The paper describes how Condor system supports or evolved to support the above said important features. Points of note are the concepts of flocking (interaction between Condor pools), gliding in (elastic, virtualized Condor pool that can be run in an alien batch system via GRAM), checkpointing and split execution that enable migration of live jobs thereby providing fault-tolerance.
The Condor has a large number of moving parts and is a complex system. The authors attribute the complexity to the responsibility (as opposed to functionality) oriented design. One another reason could be the timescale of its existence (and other engineering requirements such as backward compatibility).
Flaws:
1. I'm not sure how well the schema-free ClassAds language works between various Condor pools -- attribute names of the same things might differ between organizations unless there are some agreed upon primitives.
2. This seems like a very complex design to replicate.
3. Jobs submission needs extending Condor C++ classes, wrapper scripts or requires relinking etc. The setup time of jobs might be a significant overhead.
4. Its not clear if Condor has support for monitoring live jobs - at least to provide ETA.
5. Comparison with similar systems would have put things in perspective.
Discussion: This is a different flavor of distributed systems than what we read in the other papers (Grapevine, Google etc.). Here is the focus is on cycle-stealing rather than full utilization.
I wonder how much manual intervention the system need when executing jobs.
As I was reading the paper, I could not help myself think about VM-based clusters. Many design issues and concepts appear common between cloud-based systems and Condor - protection, remote execution, hot migration, abstraction of resource access (say, files), policy based resource management etc.
Posted by: Srinivasan T | January 31, 2011 02:48 PM
Summary
This paper gives a broad overview of the history and design of Condor, a distributed computing platform.
Problem
Communities would like to be share computing resources for batch job processing. Idle time is usually available on networked desktop PCs, however configuration and speed vary wildly and nodes may leave and join often. The owner of each resource may want to be able to control when it is available and who may use it. Additionally, programing distributed batch jobs is difficult due to many types of failures and varying environments on the nodes.
Contributions
Condor ClassAds describe job requirements and resource abilities and requirements, as well as a ranking order for both. Condor matchmakers then suggest matches of jobs and resources. The ClassAd system is generic enough and it is easily extended so that it can be used to match many types of shared resources, even non-computing resources.
The shadow and sandbox system isolates jobs from each other and from the computer where offering the resource. It also sets up a friendly environment for jobs, giving easy access to needed files.
Condor provides a master-worker abstraction to jobs to simplify distributed programing. Condor handles failures and retires. DAGMan takes lists of job dependencies and runs jobs in an appropriate order. Condor can also checkpoint jobs to allow for fast recovery after a failure.
Flaws
The paper gives a good general overview of Condor without dealing too much with specifics. This is left to cited papers. I felt that one flaw in the structure of that paper was that a description of the system from the point of view of someone writing a job was not until late in the paper. I felt this should have come earlier because it gives a good indication of what the actual capabilities of the system are.
Discussion
I felt that the Condor design was very general, because it can be used with many different types of shared resources. This generality means that the design is unlikely to become outdated soon, and Condors longevity speaks to this effect. Condor’s generality also has the downside that more work is needed to track and support many different resource and jobs attributes. Further, the available attributes will likely change as new resources become available, which add difficulty in supporting older jobs. A more uniform running environment for jobs would be easier to support, but would not expose as many resources to jobs.
Posted by: Aaron Brown | January 31, 2011 06:06 PM
Summary:
This paper gives a fairly extensive overview of the components and development of Condor, UW Madison’s distributed computing project. It outlines the design goals behind Condor as well as the software framework under which it operates.
Problem:
Condor seeks to provide a flexible distributed computing system, capable of running on a wide variety of heterogeneous machines. Because Condor is not the primary user of these machines, it needs to be able to operate unobtrusively in the background, backing off and redistributing its work to other nodes when need be and interfering minimally with the active computing environment. It also needs to be able to match nodes to jobs given a specified set of requirements for both the job and the node, potentially doing so across networks without violating administrative policies. Finally, it needs to be compatible with GRAM, a generic protocol for distributed computing.
Contributions:
Condor’s primary contributions arise from their need for flexibility in dealing with heterogeneous machines, most of which may have higher priorities than running Condor. ClassAds provide a means for matching jobs to resources, ensuring both that resources are capable of running a given job and that they are willing to run the job. In addition, universes, comprising both shadows, which provide the details of the job on an ad-hoc basis to allow for maximum flexibility in allocation, and sandboxes, which provide safe, consistent environments in which jobs run, deal effectively with the problem of ensuring a standardized environment exists across disparate machines. Both of these are features not often seen in corporate distributed computing frameworks, which can often rely on relatively homogeneous nodes dedicated to the jobs assigned them. In addition, because it provides two separate programming frameworks, Condor can address a broader variety of problems and tasks than some of its more focused brethren.
Flaws:
While the paper provides a good overview of Condor’s functionality, there are a few areas that are lacking. The paper handles the question of resource node failure loosely, claiming it will be handled by the agent, but never outlining the precise means of doing so. In addition, it provides little information on how it handles problems like stragglers. Finally, even though ClassAds have no schema, it seems like it would be useful for local Condor pools, and perhaps the global Condor system, to publish a set of commonly employed attributes to minimize requirements that have no matches.
Implications:
The paper outlines a robust, powerful framework for distributed computing, addressing the many issues necessary for users to write effective programs easily. In many ways, it resembles a more generalized counterpart to something like MapReduce; more powerful and expressive, but perhaps not as optimized or as easy to use. It seems primarily useful for academic purposes; without knowing its performance, I’m unsure how suited it would be for industry use.
Posted by: Chris Dragga | January 31, 2011 08:16 PM
Summary:
This paper outlines the core components of Condor system, and describes how the design of these components reflects flexibility, which is the basic design philosophy of Condor system.
Problem:
The basic philosophy of Condor system is flexibility. The problem this paper solves is how to design a distributed system which can provide ready access of computing resources, and have the flexibility feature.
Driven by the related experience in providing distributed computing resources, authors think that flexibility requirements are mainly from the following aspects:
1. there are many kinds of relationship among people in one condor community, so the condor system should support many kinds of schemata, protocols, relationships and obligations.
2. in order to attract more participants, the maximum control of computing sources should be kept by their owners.
3. there are many unexpected things during the execution of a work, so system should provide some strategies to retry or reassign the computing work.
Solution:
In one local condor pool, there are four parts: problem solvers, agent, resources and matchmaker. A problem solver is built on top of the condor agent and it is a programming model for managing large numbers of jobs. Users submit their jobs to agents, which remember jobs in persistent storage and find resource willing to run these jobs. Resources are nodes which are running jobs and return results back to agent. Matchmaker: Matchmaker is responsible for matching job requests and resource.
The job and resource matchmaking progress include four steps: Firstly, agents and resources advertise their characteristics and requirement to matchmaker; Secondly, matchmaker create a pair which can fulfill both agents and resources; Thirdly, matchmaker informs related agents and resource; and finally, a claiming step allows the resource and agent to independently verify the match. Resource keep the right to schedule the job distributed on them, but agents can make some plans based on the performance of resources. A language called ClassAds is created and used by both agents and resource to publish their constraints and preferences. There are shadow parts in agent and sandbox parts in resource. These two kinds of parts are used by agents and resource to communicating with each other and to provide some wrapping for local system call and protection for local environment.
In order to provide resource beyond local condor pool, authors design three architectures: Gateway Flocking, Direct Flocking and Gliding.
Flaw:
1. As the basic philosophy of Condor design, flexibility is kind of trade-off between cost and computing power. In order to attract more resources into condor pool, authors keep control of resources to their owners and design architectures of condor to support more kinds of computing resources. Since the price of pc is reducing these years, the flexibility design philosophy is under suspicion. The communication and tolerance of different kinds of pc in different areas waste a lot of computing ability and effort of programmer to design architectures and strategies. Today, Buying a lot of pc in the same configuration is a better way to provide computing resources.
2. There is only one matchmaker in a condor pool, and if this matchmaker has problem, the condor pool cannot work correctly. Condor should make replicates of matchmakers and provide a recover strategy when the working matchmaker crash.
Implication :
1. Matchmaker is a kind of control part in local condor pool. Agents and Resources publish their constraints and preference to matchmaker in ClassAds. Doing these can put different kinds of agents and resources together in the same condor pool, and make their difference transparent.
2. shadow-sandbox design: shadow is responsible for providing everything needed to specify the job at run time. The sandbox is responsible for giving the job a safe place to work, and also provides protection from harm that a malicious job might cause. These design provides communication function, sets the preference run environment for job and protects resources machine.
3. programming model: like mapreduce, master-worker and DAGMan are two kind of programming model provided by condor. By highly using the architecture feature of Condor, these two programming model can help users describe their jobs.
Posted by: Linhai Song | January 31, 2011 09:31 PM
Summary
This paper describes the evolution and current state of the Condor distributed system. Condor is a distributed system that allows various communities to share their computing resources with each, allowing for batch execution.
Problem
Their are often times when people need access to more computing power than is available in their academic or enterprise community. There are also frequently times when many of the computers available in a community are idle. The purpose of Condor is reduce this imbalance, increasing compute power available while reducing waste. Creating a system that encourages communities to share is challenging because everybody has different policies, and people are often unwilling to be greatly inconvenienced while helping others.
Contributions
Flaws
Application to Real Systems
Condor is already a real system and has been used by C.O.R.E Digital Pictures and Micron Technology for solving various problems. Perhaps the most broadly applicable lesson to other systems is that there needs to be an intelligent and adaptable way for pairing jobs with nodes in a heterogeneous environment (such as ClassAds).
Posted by: Tyler Harter | January 31, 2011 10:23 PM
Summary
This paper presents Condor, a distributed computing project which allows sharing of computing resources on a massive scale, even when those resources are highly heterogeneous, scattered, unreliable, and under the control of different administrators. The notion of independence permeates Condor, with the owner of each participating machine retaining ultimate control over what happens on that machine. Condor provides a matchmaking system which matches jobs to machines, a set of “problem solver” templates which can be used to easily represent distributed computing jobs, and a set of “universes” which are consistent execution environments for job code.
Problem:
Condor aims to provide a distributed system for job execution which solves not only the technical problems inherent in distributed systems, but also the attendant social and organizational problems. The authors mention the vision of “grid computing”, where computing resources are shared across organizational boundaries. For such a vision to be realized, resource sharing must not expose any participants to undue security risk, and it must not conflict with the policies of the administrators of any participating machines, and the barriers to participation must be low (hence the Condor motto “leave the owner in control, regardless of the cost”).
Contributions:
The notion of independence permeates the design of Condor. Individual machines act in one or more of three roles (agent, resource, or matchmaker) and independently enforce whatever policies their owner sets for them. Agents and resources make independent local decisions about which nodes to trust and communicate with, while matchmakers enforce communal policies for a pool of machines (note that agents and resources may be part of more than one pool, each with its own matchmaker). An agent and resource matched by a matchmaker are not obligated to honor the match, and either side may refuse to work with the other according to local policies.
Matchmaking in Condor is very flexible thanks to the schemaless ClassAd system, which allows jobs with arbitrary user-defined requirements submitted by agents to be matched with resources meeting those requirements. The lack of a fixed schema for ClassAds allows users to define new classes of requirements as needed.
Condor-G allows Condor to interoperate with other batch-processing systems supporting the GRAM protocol, and can even coax such systems into supporting expected Condor features by “gliding in” a Condor server as a job.
To ease the use of Condor, “problem solver” templates are provided for master-worker and directed acyclic graph tasks (where some jobs are dependent on others completing). Condor also provides several “universes” which are consistent execution environments for programs, including a standard UNIX universe and a Java universe.
Flaws:
The authors argue that the schemaless nature of ClassAds is a strength. However, in order for an agent and a resource to be matched, they must describe their requirements and capabilities in the same format (i.e. they must agree on at least a partial schema), so ClassAds really just push the responsibility for defining a schema onto the users (although, in some cases, the ability to support heterogeneous and organically evolving schemas could still be seen as an advantage).
The sandboxes provided by some of the universes may be vulnerable to exploit and constitute a security risk. If there is an unpatched local vulnerability on a resource machine, an attacker could use it to take over that machine (this seems most likely in the standard UNIX universe, which is not chrooted and relies on ACLs and the restricted privileges of “nobody” users for security). Furthermore, since many machines act as both agents and resources, a machine thus compromised could spread the exploit to many more machines. The malicious exploit could thus spread through the “web of trust” of even non-promiscuously configured Condor nodes. Running the universes inside isolated virtual machines, as is done in PlanetLab, for example, may offer some protection against this type of attack.
Applicability:
As anyone who has run “top” on a CSL workstation knows, Condor is alive and in use today. It has evolved to meet the needs of a widely distributed and diverse user base, and its particular emphasis on independence and local control seems important for any grid computing system which crosses organizational boundaries.
Posted by: Craig Chasseur | January 31, 2011 10:54 PM
Summary:
The paper describes a broad history and the design of the Condor project, which is a high-throughput distributed software framework.
Description of Problem:
Some workload require a lot of computing time and power such as image rendering or the NUG30 problem mentioned in the paper. A lot of the time, our computers and gaming systems (such as the PS3 and Xbox360) are just idling and not doing any work. It would be nice to be able to distribute the workload to all the processing power that is being shared. But there are many issues that need to be consider such as:
1. The system needs to be scalable supporting different type of machines
2. Owners of machine will want to keep their ownership
3. Planning and scheduling of a workload
Summary of contributions:
There were many ideas listed in the paper, but I'm not sure which ones were novel ideas.
1. Creating a flexible distributed system using idle machines
2. The Condor Kernel design
3. Matchmaking techniques such as direct flocking
4. Scheduling and Resource management through ClassAds
5. Split Execution represented by a Shadow and a Sandbox
Flaws:
I did not fully understand how the matchmaker was picked but there seemed to be only one matchmaker for a workload which seems like a possible bottleneck if the matchmaker was overloaded or failed. Also from reading the paper, I got the feeling that debugging a workload was tough task which might not be a flaw, but it would be nice to have more debugging functionality.
Application to real systems:
Condor is already being used in research and in businesses such as to solve complicated math problems, genetics, and image rendering. I wonder if there are any practical uses for using such system for running a video game.
Posted by: Kong Yang | January 31, 2011 11:23 PM
Distributed Computing in Practice: The Condor Experience
Summary :
The paper describes the experiences and the various features and elements of the high throughput computing system called Condor. It is based on the philosophy of flexibility, allows the user to set policies for usage of their system and is based on handling failures and retrial.
Description of the problem they are trying to Solve:
Condor provides high throughput batch computing using free resources from community PCs. It provides a job management mechanism, scheduling policy, priority scheme; resource monitoring and resource management. Condor’s architecture allows it to perform well in environments of of high throughput computing and opportunistic computing. High throughput computing is achieved by opportunistic means.
Contributions:
The papers describe various components which were developed to support high throughput computing. Various features of the system include ClassAds (Ads framework for posting availability and requirement of computing resources),Job Checkpointing (providing fault tolerance and easy migration from one machine to another) and important components like agent, resource, matchmaker(the ones which connects the agents to the resource) and the shadow and sandbox execution environments. The paper discusses various policies like Gateway Flocking and Direct Flocking; planning of resources and scheduling policies for assigning jobs to various resources. Condor implements two types of problem solvers for use – Master worker; the ones which the master assigns the work to the workers and the output effects the work queue and DAGMan; A dependency tree of jobs in which a job might be dependent on execution of other jobs. A Dependency tree can be defined and hence the order of execution. Also shadow and sandbox serves the needs of the jobs and provide a secure environment for execution.
Flaws:
The paper although talks about the evolved architecture of the system, its components and policies. It lacks in the experiences where such a system broke and what was done to resolve them or if there are mechanisms of automatic recovery. I somewhat related the system to Map Reduce. Not exactly similar, its a distributed system which is trying to match the available resources with the requirements of a set of jobs as compared to one job over large dataset (Map Reduce). I could not notice if they have considered or dealt with malicious jobs which can possible corrupt or break the system or cases where the sandbox and shadow environments were insufficient to corner out the malicious jobs. Also, an evaluation of such a system and how it performs in production is missing (probably available in references); what were the bottlenecks of such a system and the things it could not achieve.
Ideas Applicable to the real world system:
The paper provides a lot of ideas which are replicable to the real world system, Indeed condor is deployed and used in real world proving the usefulness of its components. Components like Checkpointing, automatic failure detection and retries if required is heavily useful and is implemented in real world often. Problem solvers (Master worker and DAGMan), shadow and sandbox providing safe and home-like environment for execution of jobs are also some of the good examples of components really applicable to the real world.
Posted by: Ishani Ahuja | January 31, 2011 11:27 PM
Summary
This paper describes the philosophy, history, and basic implementation of Condor, a distributed system for batch computation. Condor’s key objective is the efficient utilization of idle and heterogeneous resources, and the guiding design principle is extreme flexibility.
Problem
Easily commanding the computational power of heterogeneous and non-local resources is a pervasive problem in distributed computing. Within this scope, the Condor system primarily targets computationally intensive batch jobs, where throughput and efficiency is of much higher concern latency.
The fundamental approach uses components which have distinctly separated concerns. At the top level, a “Problem Solver” manages the flow of the portions of the job. It relies on an “Agent” to procure resources and provide reliability. The agent communicates with a “Matchmaker,” which tracks and suggests possibly compatible resources. The “Resource” itself is responsible for determining appropriate scheduling.
Implementation Challenges
One of the biggest challenges is making the system useful for a variety of environments. To this end, Condor introduces ClassAds, which is a language that is used between a matchmaker and potentially many resources and agents, to coordinate their compatibility. ClassAds have no specific schema, so system implementers are free to invent their own properties/requirements, or dynamically change them at will.
Condor must also provide a high degree of control to system administrators to allow their computing systems to be used only in the intended manner. The ClassAds language provides mechanisms for specifying for “Resources” to specify who is able to run jobs, and when they are allowed to run. Condor also limits job interference by using sandboxing techniques, which restricts the system privileges of running jobs.
Because of the varied nature of Condor resources, user transparency is especially important. One way Condor furthers this goal is providing a runtime library that intercepts I/O system calls, and relays them to the user’s file system. This makes it seem like the job is running in the user’s home environment. Condor also provides transparency through a checkpoint mechanism, which is able to pause and resume jobs on different machines, according to resource restrictions or hardware failures.
Limitations
One fundamental limitation of Condor is that there must exist some trust between the user and the resource owner. The sandbox is the natural target for exploitative attacks. Hopefully there is a mechanism to detect such discrepancies and blacklist offending users.
It’s fairly convincing that the checkpoint mechanism would be able to transparently shut down and resume a job. However, I’m not sure how this method achieves consistent I/O in the face of hardware failures.
Also, a schema-free resource allocation policy seems desirable for flexibility, but interoperability between organizations could suffer because of incompatible requirement expressions.
Thoughts
Condor’s widespread use has proven that there is a distinct market for parallel computing systems which opportunistically take advantage of available resources. Even in its simplest capacity, I’ve found condor to be a useful tool for increasing productivity with batch computations. Overall, the Condor experience has shown us how a powerful system can still be reliable, secure, and above all flexible.
Posted by: Tony Nowatzki | January 31, 2011 11:31 PM
Summary:
The paper describes how Condor evolved from an experimental system to a massive and widely deployed high throughput cooperative batch computing solution. Authors describe various components of Condor's design and outline how these components interact to present a user with a secure and distributed computing facility to harness idle computing resources in and across organizations.
Problem:
Large amount of computational power achieved through supercomputers has been an expensive and scarce resource. People realized that idle CPU time wasted on largely available inexpensive commodity computers can be collectively put to work to achieve much needed computational power. But the fundamental issues arising out of uncontrollable failures, heterogeneity of machines have always kept people from utilizing such distributed computing power. Condor provides highly scalable and flexible computing system which can consolidate the available distributed computing power to be used by computationally intensive jobs.
Contributions:
Condor has various modules viz shadow, sandbox, matchmaker, agent, problem solver; which are responsible for separate but inter-related tasks. The modular design of Condor allowed it to incorporate diverse policies for job scheduling, resource management; support various communication protocols and seamlessly interface with other batch systems. The division of responsibility among independent modules made it easy for Condor to plug-in functionalities (e.g. gateway/direct flocking, secure communication) as and when they became critical. Throughout its evolution, Condor did not undergo radical changes in its core architecture. Condor puts forward an example of how good design can allow seemingly complex system to remain flexible, scalable and robust.
Success of Condor, in achieving its goal, depicts that a distributed system is not technically hindered by the heterogeneity of machines & their operating systems, implementation technologies used by tasks, and communication protocols used among the machines. Condor shows how an absolutely alien machine can be used for running tasks on behalf of other machines by creating appropriate execution environment and fetching required data for processing as and when required.
Flaws:
Since the various jobs executed on Condor are fairly independent, centralization of match making process does not make sense. Having a single matchmaker, to manage both idle resources and unscheduled jobs, can severely limit the size of Condor Pool. Also, multiple matchmakers can help to improve availability of system in events of machine failures and network partition which can make the matchmaker inaccessible to entire or portion of pool.
One limitation that I see with Condor is that it forces jobs to be run only inside sandboxes, thus a user can only run jobs which have corresponding universe supported by Condor. If an organization wants to run its special jobs and it should be allowed to do so as long as it makes sure that necessary execution environment is set up on machines which would be matched to run such jobs.
Applicability:
Condor has provided people with an inexpensive solution to aggregate idle computation power in their reach. Condor has seen wide adoption in academic institutions as well as in industry. Researchers have employed Condor to solve their computationally intensive jobs.
Posted by: Sandeep Dhoot | February 1, 2011 12:12 AM
Paper summary:
This paper introduces the technical evolution and philosophy of the Condor, which is a high-throughput distributed batch computing platform. The author shares the problems encountered during the development and core technologies such as direct flocking and split execution they used to handle those problems.
The main problems they were trying to solve:
1. To build a decentralied system that every participant in the system can control how its resources are used according to its preference. A solution is needed to serve the user requests efficiently under such situation.
2. To build an international distributed system on top of heterogeneous commodity machines all over the world. Those machines may have different hardware, operating systems, applications, configurations, private policies, network reliability. They may even crash. It's hard to manage those machines correctly and effectively.
3. To build a grid computing platform that individuals can use the available machines in different communities and even different batch system.
The main contributions of the paper:
1. Introduces a matchmaking algorithm to let agents and resources work together without violating their own requirements. In the algorithm, agents and resources first tell the matchmaker their characteristics and requirements using ClassAd and let the matchmaker determine which resource is assigned to the agent based on global information. After notifying by the matchmaker, the matched agent and resource contact each other and execute the job independently. This algorithm solves the problem 1 above.
2. Introduces "split execution" technique to guarantee the job to be executed securely and correctly in a hostile environment. The "shadow" component in the agent is used to provide necessary information needed the running job, while the "sandbox" component is used to reproduce the appropriate environment in the resource machine for the job to run.
3. To solve the problem 3, this paper presents different solutions used along the evolution of Condor: gateway flocking, direct flocking, GRAM, and gliding in. The final solution "gliding in" combines the advantages of GRAM and Condor machinery in the way that Condor servers are submitted to multiple remote batch systems as normal jobs so that a user can form a personal Condor pool with personal agent and matchmaker to run his jobs.
The potential flaw of the paper:
Based on my understanding, for the "gliding in" approach, when the servers begin executing on the remote machines and try to contact the personal matchmaker or matchmaker is making match decisions, if the matchmaker is down at this time, I think there is no way for the servers to know who to contact next after the failure. In this situation, the agent may need to resubmit the condor servers to form another personal condor pool, which may be time consuming.
In order to execute a job, the resource needs to create a sandbox to make a safe place for the job to run, which takes time. It would be helpful if ClassAd can include more information about the job like the configuration and environment of the job so that matchmaker can assign the job to the machine whose sandbox has the similar environment as what the job requires. This makes the sandbox sharable. Otherwise, although sandbox successfully make hostile machine runnable by the job, it poses some overheads to setup the environments. If there are lots of jobs with short running time, the overhead of sandbox may be large compared to the job running time.
How the ideas in the paper are applicable
As described in the paper, the mathematicians use Condor's Master-Worker(MS) model to execute a parallel NUG30 problem solver over 2500 CPUs at ten different sites spanning eight institutions. "Direct flocking" and "gliding in" are used to make resources around the world accessible via Globus GRAM protocol.
Posted by: Weiyan Wang | February 1, 2011 02:37 AM
Summary:
This chapter talks about Condor software which enables opportunistic use of idle processors of commodity class computers in a community like environment. The variety in hardware and software configurations makes design of Condor challenging. Condor conquers complexity with use of flexibility. This means that rather than using rigid configurations to enable sharing of computing resources, Condor uses cooperative means of enabling dynamic sharing of resources. This is achieved by its modular nature, where different pieces of software communicate with each other before deciding which job is submitted to which resource.
Problem Statement:
In an organization with many commodity class computer, there is always an opportunity to exploit idle computers for executing processes that would overburden the local system and reduce its responsiveness. To do this one needs to find an idle or somewhat idle machine in the organization, one that is compatible for performing/executing that process, one that is willing to do spend time in doing this job, then transfer the job to this remote machine with all the supporting information and then collect the necessary state to continue its own local execution. All this needs to be done in a fault tolerant way to recover from local or remote crashes, using sometimes transparent or sometimes blind mechanisms.
Contributions:
In designing Condor, providing flexibility was the goal and we see that this goal is what drives Condor to its success. Condor allows the each user-machine/group of machines to participate in cooperative sharing with its/their own set of rules. This is achieved by use of schema free language called classAds which allows jobs to specify their requirement for execution and machines to specify their rules to allow execution of jobs from other machines. Condor also performs periodic check pointing of jobs into stable storage to enable preemptive scheduling and also migration which tends to make the whole system fault tolerant. Other fault tolerant features include use of sandbox and shadow components in the scheme of things that enable remote execution. Sandbox allows the remote job to securely perform remote IO on the machine where the job was initiated without transferring all the required files to remote machine or without the need of a shared file system. Other contributions include development of high level abstraction for controlling large number of jobs such as Master-Worker scheme and the DAGMan service. To enable remote program execution without programmer intervention is made possible with ideas like standard and Java universe. These universes try to incorporate all the standard resources, libraries that programs need in their respective environments to facilitate more fault tolerance in remote execution.
Thoughts:
The use of words ‘high throughput’ is sometimes misleading in the text. Condor actually slows down the execution of a local process but tries to improve the overall completion rate of processes across the many machines participating in the system. This limits Condor’s use in case where we want to improve throughput of a distributed system while maintaining the locality of data in execution. Also it feels that in terms of its placement in the software stack, Condor sits near the top level of the stack. I am wondering if it would be of any help to incorporate cooperative execution in the operating system layer. This would increase the complexity of the OS but Condor is no trivial piece of software. Perhaps RPC is not an afterthought in incorporating remote execution at the OS layer, maybe it’s the best one could do. I got goose bumps while reading about how Condor works across different protocols to enable remote IO. I find all this as the first ingredient of unpredictable performance and uncertain failure. Maybe I am paranoid, maybe I am not.
Posted by: Paras Doshi | February 1, 2011 03:03 AM
Summary:
This paper discusses the history and development of the Condor distributed system. Starting with its inception in the 80s and continuing up until the time of the writing, it discusses design philosophies, problems they encountered as their system grew, and how they addressed those problems.
Problem:
One of Condor's more interesting aspects is its focus on flexibility – it seeks to bring together a large, diverse populations of machines into one distributed system. Since it can make no assumptions about the hardware or software running on nodes in the system, getting these machines to interact and work well together is a challenge. On top of this, one of Condor's philosophies is that it should only make use of a node when it is idle. This creates a very dynamic environment and introduces problems with machines entering and leaving the network, stopping, starting, and migrating jobs, and so on.
Contributions:
The paper discusses a whole lot of problems the Condor team encountered and how it fixed those problems. Some of their solutions to these problems have persisted over the years; some haven't. I think the biggest contribution of this paper is seeing the evolution of a real-world system over several decades and how its needs and requirements changed as the system scaled. However, there are more tangible contributions as well. One lasting aspect of their system is the ClassAds language, used for matching resource requests with available resources. ClassAds provides flexibility in allocation policy (how resources are distributed) as well as in planning (how users acquire resources). With ClassAds, jobs and machines are matched together based on a fine-grained list of criteria about both the job and the machine. Another important contribution, and one that again improves the flexibility of the system, is CEDAR, the communication library used by Condor. CEDAR allows varying security policies and mechanisms to function together; it controls authentication, privacy, and data integrity for data and users on the network.
Flaws:
For once, I actually would have been interested in seeing more performance results than just the three case studies they mention. I suppose that's not really the focus of the paper, but it still would be nice to see how Condor stacks up against other distributed systems. Does it handle some workloads better than other systems do? What workloads does it do poorly on? How could it be improved to handle these limitations?
Application:
This paper showed quite clearly how useful Condor can be in the real world. Both its longevity and its widespread deployment speak strongly of its applicability to the real world. I think Condor's focus on heterogeneous networks makes it a very applicable system – with so many diverse computers in the world today, and more and more of them becoming connected to the internet, the ability to take a wide range of them and use them together is a very powerful tool.
Posted by: Ben Farley | February 1, 2011 03:25 AM
Summary:
Condor is a workload management system targeting on high throughput computing by utilizing the idle resources of a large pool of distributed commodity machines. Its users can contributed resources and execute the jobs on this platform. This paper introduces the history of Condor project and its several main components.
Problem:
High throughput computing is demanding in both research and business areas. To avoid the high cost of buying and maintaining a large number of machines, Condor tries to provide a uniform framework for users to submit their computing intensive jobs easily, safely and flexibly by stealing the idle computing resources from a large sharing community.
Contributions:
1. The idea of using the idle resources to save money and environment is very attractive in high level.
2. Condor provides a very flexible environment for its users to contribute and consume the shared resources. This directly makes the user's life easier and enlarges the Condor community.
3. It has a complete planning and scheduling framework to make the computing in a distributed manner efficiently.
4. It provides very friendly interface for its users, including the UNIX universe and JAVA universe.
5. It also provides certain degree of security.
Flaws:
1. It seems that Condor does not handle data intensive workload well. Especially, the matchmaker does not consider the file / storage locality when making decision about where to execute the job. So, when the job begins to execute, it may needs to copy files from other machines through network. Actually, there are no uniform file system in Condor. Compared with MapReduce's moving computing to data mode, Condor focuses more on computing resource rather than the data locality.
2. For security, Condor assumes that a fair amount of trust exist between machine owners and users. In such a complexed and large distributed system, fewer trust is better. Especially, one user may belong to multiple community in Condor, thus a malicious user may break the sandbox and cause large scale damages.
3. Failure must happen very often in Condor's distributed environments. So, failure handling should be the first-class citizen in Condor. However, there are few content about how to handle failures and how failures will affect the general design of Condor.
Applicable:
Condor is a widely used in many organizations across the world, such as universities, national labs, even companies etc. There is a dedicated team developing Condor and providing support for their users. For example, Condor is used to manage the desktop machines in Computer Sciences department of UW for the students. More importantly, in terms of research, Condor provides enormous computing resources for several departments here, such as biology, physics, chemistry.
Posted by: Lanyue Lu | February 1, 2011 10:32 AM
Summary: Lamport presents a method for constructing a total ordering of events amongst processes, based on a partial ordering. The addition of tightly synchronized physical clocks provides a stricter ordering and avoids potentially anomalous behavior present in the total ordering.
Problem: The problem the paper seeks to address is the synchronization of events in a distributed system. Unlike a single threaded application, multiple actions can occur simultaneously in a distributed system. Coordinating the events to provide some desired property, e.g. mutual exclusion, requires a method for synchronizing the many processes executing in the system. We typically think of events occurring based on a physical notion of time, but distributed systems require strictly conforming clocks to synchronize based on physical time. Instead, we need a method to synchronize events based relative times. For example, providing a purchase confirmation to a user should not occur until after the sale has been recorded. Solving the problem is even more challenging because message passing, which can incur noticeable delay, is the only method for establishing an ordering of events between two or more processes in the distributed system.
Contributions: Lamport's first contribution is a distributed algorithm for establishing a total ordering of events amongst multiple processes. The foundation for the algorithm is the idea of a partial ordering: event a comes before event b if 1) a and b are events in the same process and a occurs before b or 2) a is the sending of message by one process and b is the receiving of the message by another. By adding a logical clock to each process, every event can be assigned a logical time of occurrence. Process occurring the same event are assigned increasing timestamps as they occur. A message sent between processes causes the receiving process to advance its clock at least beyond the sending time indicated in the message, which is based on the sending process' clock. If the time assigned to an event by a process' clock is before the time assigned to an event by another process' clock, then a total ordering of the events can be established. This total ordering can be used to serve a queue of requests, for example, requests for mutually exclusive access to a resource.
The second contribution is a stricter total ordering enabled by the addition of tightly synchronized physical clocks. Physical clocks avoid the issue where a request issued by one process may arrive first and be served first, even though the request was generated after another process generated a request. This can be important in the purchasing example identified above, where the first person to click the purchase button on his/her machine should receive the last concert ticket instead of the person who submits their order second but has a higher speed connection. Lamport provides specific bounds for how incorrectly a clock can keep time and how unsynchronized it can be without causing any loss of ordering.
Flaws: No major flaws exist in Lamport's algorithm. However, one item to consider is the number of message exchanges required to keep systems synchronized. Events which require tight synchronization may spend significant time exchanging methods instead of doing actual work. Therefore, developers need to balance the need for ordering events with the overhead of communication to still receive the benefits of executing tasks in a parallel, distributed fashion.
Applicability: Synchronizing events based on message exchange is important in real-world systems which rely on a sequence of events. This is certainly easy to system in the purchasing system example used in this review. However, synchronization is also important in some not so obvious tasks. For example, a search on Google does not require the search to happen in a synchronized fashion, but merging results from multiple clusters to provide the results to the user requires some form of synchronization. File systems are another common example of the need for synchronization, and file systems which rely heavily on physical timestamps can often experience problems when tight time synchronization is not present.
Posted by: Aaron Gember | February 7, 2011 08:24 PM