« Orleans: Cloud Computing for Everyone | Main | Autopilot: Automatic Data Center Management »

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. Zaharia, Matei, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleey, Scott Shenker and Ion Stoica. Eurosys 2010

Reviews due Thursday, 4/19

Comments

The paper presents a scheduling algorithm for running jobs on data-intensive cluster computing systems, such as Hadoop and Dryad. With the increasing popularity of these systems in solving computations for large data set, they are being shared by multiple users for reasons such as improving utility and data consolidation. The scheduler of such system faces a conflict between fairness in scheduling and data locality arises. The paper offers a solution to that problem with a technique called delay scheduling, which tries to improve locality while keeping fairness by not choosing the task that should be scheduled for a short period if it cannot be assigned as a local task. This simple technique is shown to be suitable for the cluster-computing system with computing models like MapReduce. It results in significant improvements over normal scheduler from the experiments presented in the paper.

One of the advantages of the algorithm is that it provides a nice balance between utility, fairness and locality, all three of which are required to maximize performance of the system. By not scheduling the first task in queue for a short amount of time, the scheduler violates the fairness condition. However, with a good choice of time interval, the fairness condition is acceptable in the long run, and the probability of assigning local tasks increase exponentially each time the task is skipped. The nature of the jobs and the system are the main reason for the benefits of this scheduler. The jobs have small tasks that can be completed in much shorter time compared to the whole job. The system also has multiple machines, decreasing the waiting time between each task is scheduled. Thus the scheduler provides a good trade-off between fairness and locality without compromising utility.

In addition, the Hadoop Fair Scheduler (HFS) has features that are useful for multiple users with different needs. One good idea is having two levels of scheduling policies -- an internal policy customizable by each user, and a global policy. The internal policy allows each user to set the minimum slots desired, and the internal scheduling technique. Therefore, the user can choose internal scheduler based on the type of the job.

As mentioned in the discussion section, the basic idea of the scheduler can be used for scheduling preferences other than data locality. The research studies have shown that the performance of the jobs can vary depending on the architecture of the machine it runs on, and the tasks it is sharing the machine with. Therefore, instead of just checking the locality, it could possibly be used with a performance maximizing function instead. In this case, a task would be skipped for a short time if the value of the performance function evaluated with the task and the host machine does not meet the minimum value. However, we need to be careful not to add too much computation overhead.

This paper is about a technique for scheduling jobs in map-reduce-like clusters. The authors specifically focus on developing a scheduler for Hadoop.

The two main goals of the authors in developing their scheduler are fair sharing and locality. Fair sharing means that the resources are shared fairly -- that is, each job gets a fair share of the resources. Locality means that jobs tend to be scheduled on nodes that have their data.

Under the naive fair sharing algorithm, jobs are scheduled in ascending order of the number of running tasks. Thus, when a node request a new task, the job with the fewest tasks currently running is selected to run. The two problems with this algorithm are head-of-line scheduling and sticky slots. Head-of-line scheduling is the problem that, once a job reaches the head of the line, its task will launch on whatever slot becomes available next, regardless of data locality. The sticky slots problem is that, if a job is next in line, and one of its tasks finishes, it will end up getting the same slot that is just had. Since data is typically striped, that slot is unlikely to have the necessary data for the next task.

The authors propose to solve these problems by adding delay scheduling. The core idea behind delay scheduling is that, if a job comes up for scheduling, and none of its unlaunched tasks have data on the node, it will wait up to a certain number of seconds D in order to see if a slot becomes available on a node with a task's data. If D seconds elapse, it will go ahead and schedule on the next available node. The authors also introduce further complexities, like different levels of locality, to try to gain further performance improvements.

One flaw with this paper is that the author's fail to make a convincing case that data locality is worth all this effort in their environment. I was somewhat surprised to see that delay scheduling did not gain that much performance improvement over fair scheduling alone. Part of this may have been because the authors used a delay of 5 seconds for their benchmarks. Given that this delay gets them almost 100% locality, I suspect it was too high. For example, in their IO-heavy workload, bins 1-8 almost always take less than 50 seconds, so a 5 second delay could add 10%. (Of course, just because their is a 5 second delay doesn't mean all 5 seconds are typically used. It would have been nice to see distributions of how much delay was used for each task.)

I am also a little upset that I toiled through so much math to see it largely go to waste. Much of the mathematical reasoning seemed to me to be based on questionable assumptions about the independence of certain classes of events. Equation (2), the most interesting equation in the paper, isn't really used, except that the authors expect administrators to use it to calculate D. I think that expectation is generous. I don't understand why the software couldn't just take desired locality as a configuration parameter instead and calculate D online given its knowledge of the system at that moment.

This paper presents a simple but efficient algorithm for scheduling tasks in clusters and does comprehensive evaluation on the measure of running time, locality and speedup. This algorithm is based on a simple observation that tasks can usually get better data locality if they wait for a moment before they are scheduled so that running time of all tasks can be reduced because of less IO time.

To make most use of computing ability of clusters, effective scheduling is significant. The traditional factors to consider are fairness and utilization. Here fairness and utilization are more likely to be referred as fairness and utilization of CPU. Though they are important, only considering factors on CPU level cannot give effective scheduling because other level factors, like disk IO and network communication, also significantly affect the scheduling performance. This paper particularly aims at improving the locality of data to reduce network communication.

Delay scheduling is a smart and simple idea. It is quite intuitive to understand why it can work. With the goal of increasing data locality in mind, for a given task, if currently a preferred node is full, we don’t really have other choices rather than waiting for the node becoming idle again. The longer a task waits, the more likely high locality can be achieved. On the other hand, the longer a task waits, the more latency it involves. So there is a trade off here. Finding out the trade off and dealing with it properly consist of the delay scheduling. Next characteristic of delay scheduling is utilizing hierarchical locality that is similar to the hierarchical storage structure. Rack locality is not as effective as disk locality, but it can be easier to achieve and cover more cases than disk locality.

The evaluation of this paper is also very impressive. There are many different experiments on measuring the algorithm and really interesting results are shown in figures and tables. One thing particularly interesting is delay scheduling is much better than FIFO scheduling when the tasks size is small, but worse when the size is large. One possible explanation is that when tasks size is large, the expectation waiting time for a free slot also becomes large, so the benefit of locality is overwhelmed by the cost of waiting.

I cannot figure out why delay scheduling can achieve much better results for the setting of CPU-heave workload when the task size is small. The paper says it is because the cluster is more heavily loaded and delay scheduling has a negligible effect. For the first reason, I don’t really see its direct connection to the speedup. For the second reason, it is also applicable to the case when tasks are IO-heave, but we don’t have that much speedup in that case. On the other hand, I was expecting that the speedup should be less for CPU-heave workload. Delay scheduling in its nature is to use data locality and reduce IO time, so there should be less gain when the tasks are CPU-heave since the bottleneck is not IO any more.

One flaw of the paper is it does not have a solid discussion on parameter tuning. It has some mathematical analysis of the scheduling algorithm, but the analysis involves too many assumptions to be useful for parameter tuning. Also, in the evaluation section, I was expecting some experiments on how the waiting time would affect the result of the algorithm.

In all, delay scheduling is a smart idea and deployment at Facebook and Yahoo! proves that it can really work in real world.

Summary:
This paper is just one in a series of Hadoop/cluster papers from Berkeley RAD lab or more accurately, from Matei. :) The delay scheduling idea is presented in the context of Hadoop Fair Scheduler. The basic idea of delay scheduing is to acheive a tradeoff between fairness and locality/performance, or to trade a little fairness for locality when the job picked cannot launch local map task on a node. They demonstrate that by making the job with the fewest running tasks wait a small amount of time for nodes that could meet locality requirements, it is possbile to acheive higher throughput while maintaining the fairness. This idea is only for map tasks. Delay scheduling can also be done (and is actually done in Apache Hadoop implementation) in multiple levels -- if the node locality cannot be met after waiting for some time, the rack locality is also allowed; furthermore, the map task can be placed on any node if it is still impossible to get rack locality after a certain amount of time.

Delay scheduling is based on assumptions that most tasks are short than jobs and there are multiple locations (map slots on a node in MapReduce context) in which a task can run to read a given data block.

Extensive experiments (CPU, I/O or communication-intensive applications) are conducted to show a small amount of waiting could bring significant improvement.
--------------------------------------------------------------------------------------------------------------------
Motivation:
As the Hadoop cluster is shared and multiplexed between a number of users, there is a conflict between fairness and performace/locality. With original FIFO scheduing, small jobs may need to wait longer for big jobs and thus are treated unfairly. Even with naive fair sharing, if we strictly implement fairness, we may still have head-of-line and sticky slots problems since we always pick the job with the least number of running tasks. In this case, the map task lauched may be forced to fetch data from remote node and thus we get degraded performance.
--------------------------------------------------------------------------------------------------------------------
Contributions:
- It's not just an idea or a paper. It runs in real production clusters! HFS is shipped with Apache Hadoop distribution and is a widely used scheduler. You can find delay scheduling there in assignTasks function!

- Delay scheduling is a nice idea that could be generally applied to other schedulers and framework other than HFS and MapReduce. For example, the author also implemented delay scheduling in a hierarchical scheduler Nexus. Actually they also try to apply it to another cluster manager called Mesos they developed recently which supports multiple frameworks such as MapReduce, MPI and Spark.
--------------------------------------------------------------------------------------------------------------------
Flaws:
- Reduce tasks in a job are often launched after a few map tasks complete, and a reduce task cannot go ahead before it collects immediate output from all map tasks. The delay scheduling of map tasks might mean some reduce tasks will have to wait for their corresponding maps longer while holding the resource on the node that could have freed up earlier and used by other tasks otherwise.

- Locality is an important aspect but the improvement from locality might be limited. For example, it is well known that shuffle in reduce task is usually the bottleneck of the whole job, and another paper Orchestra in Sigcomm'11 from the same lab (RAD in berkeley) also pointed out shuffle may account for 1/3 of the total completion time. Furthermore, for other applications other than MapReduce, they found network may still be the bottleneck in inter-task communications, e.g. some machine learing jobs are recursive and need to broadcast after each round of computation. I'm just curious about how much benefit we could get from locality compared to the network part. (I admit locality also helps reduce network traffic, so by "network" here, I refer to those traffic that can hardly be reduced by locality such as shuffle and broadcast).
--------------------------------------------------------------------------------------------------------------------
Applicability:
As mentioned above, the shuffle phase in reduce is usually the bottleneck. Although there is no locality for reduce, the "waiting" idea might be adapted for reduce task. Specifically, we want to avoid network congestion since shuffle is communication-intensive. So instead of randomly picking a node to launch a reduce task, we may delay scheduling this reduce task on a different node where it could choose a different set of routes in the cluster and bypass the congested links (for example, we know where the map tasks will be base on their split locations (because of delay scheduling of map tasks, with high probability they will placed on the same node as their splits), so we could estimate the routes/links used if we place a reduce task on a particular node).

Also, as 10Gbps Ethernet and SSD become popular, it is interesting to see if locality still that matters with higher disk i/o and network bandwidth or which part in the cluster will become the new bottleneck.

Post a comment