CS 739 Reviews - Spring 2012: MapReduce: Simplified Data Processing on Large Clusters

Summary of the paper:
This paper presents MapReduce, a novel distributed computation paradigm which developers are required to cast a computational problem in the form of two atomic components, Map and Reduce. Programmers have to write just two functions for each component. Then, these two functions are run in massively parallel fashion: mappers run the map function, and the output from mappers is then fed to reducers, which run the reduce function. Mappers are scheduled for data-locality, moving computation to where data is stored to minimize network communication. The map phase essentially does some data-parallel operation, while the reduce phase aggregates results from the map phase to produce the final output.

Problem this paper try to solve:
MapReduce aims to provide a framework that allow programmers to process huge amount of data over a distributed system consist of commodity machines. The challenges of such framework includes how to achieve computation parallelization, fault-tolerance and load balance in a simple manner with less powerful commodity machines instead of using some centralized super power computers.

Contribution of this paper:
The major contribution of this paper is the MapReduce program paradigm itself, especially for its simplicity and high abstraction. Such a simple, limited programming model can accommodate such a wide variety of tasks, and that almost all of the complexity of running code on thousands of machines can be abstracted away from the programmer. In addition, its transparent fault tolerance mechanism is impressive. A single master manages all the workers. Failure of workers is handled transparently, by restarting the worker. This is possible because the output from the map stage is durably written to disk storage, then read by the reducers. Mapper input, of course, is durably stored as well, so they can also be easily restarted.

Flaw of this paper:
In a cluster, since a single master manages all the workers, thus this potentially means that the master may become a single point of failure. Master should be in charge of assigning resources for works as well as tracing the progress of its workers, therefore, it may be overloaded and may cause high latency.
In addition, although the model of MapReduce is designed as simple as possible (just with Mapper and Reducer functions), it seems that not all applications can work or work well under such frameworks.
Finally, this paper provide little comparison of performance between MapReduce and other relative parallel computation frameworks, therefore, the benefit and efficiency of MapReduce over other frameworks may not clear to me.

Application:
MapReduce is widely used in real world recently, and many applications (especially large dataset processing) work well over the framework. The most attractive points of MapReduce is its simplicity and less expense. Its simple abstraction of program paradigm make programmer working on large dataset process more efficient.

Posted by: Junyan Chen | April 12, 2012 10:27 AM

Summary:
The paper presents MapReduce, a simple yet powerful programming interface which enables automatic parallelization and distribution of large scale computations. It also allows implementation on large scale of commodity PCs to achieve high performance. All the magic here is to provide just two user-defined functions -- map and reduce, and MapReduce framework will take over the rest.

Problem:
The motivation of MapReduce is to provide an automatical way for parallel execution on a large cluster of commodity machines. Since small commodity PCs became popular, it was not a good idea to run applications with high resource requirment on a few super computers any more. But these small commodity PCs come with less capacity and weaker capability, so it is desirable to find a way to make things fault tolerant and make this large scale cluster easy to program with.

Contributions:
+ MapReduce is a general and easy-to-use programming paradigm for distributed computing, in particular for data-intensive applications. Developers only need to write their own map and reduce functions and they are done!

+ MapReduce is a flexible framework. There are no dependencies between different map tasks or reduce tasks, so it is easy to re-run failed tasks or even run speculative tasks at the same time without incurring much overhead.

Flaws:
- Definitely, MapReduce is the most popolar distributed computing framework so far, but I do think some applications may not fit in this architecture. Compared to Dryad, which models a job as an arbitratry task graphs, MapReduce seems to be a special case of Dryad, say it is just like bipartite graph which consists of just two stages of tasks -- map and reduce.

- As the cluster scales up, the master could easily become the bottleneck and get overloaded. The functionality of master includes at least two parts: in charge of assigning the resources in the cluster to tasks and keeping track of the progress of tasks. For example, in Apache Hadoop implementation, the workers (TaskTrackers in Hadoop) need to peodically exchange resource and task information with the master (JobTracker in Hadoop) via heartbeat, and the number of heartbeats the master can process in a second is limited. So it is easy to imagine the workers will get a delayed response from the master in a large scale cluster.

- The generality may come at the cost of performance, e.g., for iterative machine learning jobs, it has been proved the general MapReduce framework does not work well. And there comes another optimized version for this kind of applications -- the Spark (from Berkeley).

Applicability:
Although the basic framework of MapReduce remains the same, more and more optimized variants and distributions of MapReduce implementation are still emerging, e.g. Apache Hadoop distribution backed by Yahoo, and another distribution by Cloudera. Yahoo even have a spin-off now to dedicate to the development to the next generation Hadoop, which basically re-architecure the implementation of Hadoop and divide the cluster resource management and task runtime monitoring into separate modules to relieve the workload at the master. And as the flash technology becomes more and more mature, researchers begin to think of the opportunity to hold the data in memory for optimization. Furthermore, in some cases people may even choose to run on a single machine rather than use MapReduce. For example, one of my friend in database group in our department said they now turn to single machine solution because it is feasible to add tens of GB memory in a single machine. So now they could hold whole data in the memory and do not use MapReduce anymore due to its high communication overhead. Actually, the main reason they used MapReduce before is they cannot hold such large data set on any single node.

Posted by: Yizheng Chen | April 12, 2012 08:13 AM

MapReduce is a tool used to help write parallel data processing code by abstracting away a lot of the coordination issues. It cannot be used for every such project but for problems that fit the model of a function applied to local data where the results need to be aggregated, the system is a popular option.

One thing most of us should have noticed is the existence of a single master to solve all the coordination, consensus, and assignment problems involved in setting up a parallel algorithm. By backing up local state at another node, they are diminishing the "single point of failure" concern. Fortunately, the amount of direct access the workers need with the master in terms of network use (mostly location look-up) is much smaller than the amount of access required to communicate with each other so the network bottleneck effect is not as significant. Local memory at the master still provides a hard ceiling that the rest of the system can't compensate.

Scaling can be done by increasing the number of machines involved in a given computation. More interesting is the way load balancing is done by making each task smaller, splitting the original data set into smaller pieces. Small tasks can be reassigned to an idle machine if the original machine running them appears to be overloaded. So long as there are no issues with multiple conflicting copies of the same output, this is a great idea. For all tasks, the designers solve this by storing the output locally and trusting the file system to keep access to it atomic so either all the data on a particular key is transfered and saved or none of it. Happy side-effects include greater reliability as failed components won't cause the whole system to fail and quicker running time as the system is not constrained to wait on any single slow machine.

The strength of MapReduce is also a weakness. It's simple model is easy to use but also constrained and doesn't provide all the coordination operations one might want. As the mapping functions never speak to each other, it would take multiple iterations to do any operation that can't be localized to the particular data chunk sizes allowed. They built in a query system to collect stats and status but it sounds as if debugging is still hard.

I think the system is very good at the narrow range of problems it targets. There may be issues with scaling if problem sizes increase. There was no testing in the paper of how bad performance gets if the input data must be accessed remotely or if the machines involved in the computation have to be split over more than one data center.

Posted by: Brian Nixon | April 12, 2012 08:07 AM

This paper presents a novel programming model called MapReduce that provides a simple abstraction for doing distributed computing over a large cluster.

Designing error-free, robust and efficient parallel programs has been considered as a hard and time consuming task when compared to writing sequential programs. Efficiency is one major incentive behind parallelizing a problem (better performance than the sequential counterpart). It should be noted that the method of parallelizing a sequential problem is highly dependent on the problem at hand. This paper tries to provide an abstraction that can be used to solve a huge class of data parallel problem. One major goal of the authors was to make this abstraction easy to understand and program, hiding all the gory details of parallelization in the runtime. At Google, they had huge data-parallel problems that take in terabytes of input and they wanted to parallelize these problems over a large cluster of commodity machines.

The major contribution is the MapReduce programming model. In this model, problem is split into two phases, a Map and a Reduce phase. Map phase generates an intermediate set of key-value (kv) pairs from the input files and reduce phase operates on this intermediate result to generate a final output file. The latter mostly involves merging of the set of intermediate kv pairs into fewer kv pairs. The authors provide many examples that help understanding this model. Another major contribution is the implementation of the runtime for MapReduce programming model. The implementation varies based on the scale of the problems to be solved using this model. The authors were looking at large data-parallel problems that takes in thousands of terabytes of input and few hundreds of terabytes of output. Also, the computing environment is a large number of commodity PCs connected via (100 MBps/ 1 GBps) Ethernet. The implementation uses a master-slave architecture to transfer data/control between a set of worker machines. The master assigns the map tasks (each with its own piece of the input file) to a set of workers. After the completion of each map task, the control is transfered over to the other set of workers to do the reduce task. Fault tolerance is achieved by re-execution of the corresponding task that a failed worker was responsible for. Master failures requires restart of the whole MapReduce task altogether. Re-execution is possible by discarding any intermediate and final outputs that might have been generated. They also suggest many extensions to optimize this basic implementation of MapReduce interface.

Though the programming model seems to greatly help in parallelizing many types of data-parallel like problems, certain design decisions taken in the runtime implementation and their trade-offs are not clear from the paper. For example, it is not clear on why the reduce task needs to be done on a separate machine instead of reusing some of the workers that completed the map tasks. It is clear that reduce tasks cannot start until after the completion of at least one map task (after which that worker system is unused). Though it is true that a map task output can feed more than one reduce tasks but that doesn't require all the reduce tasks to be on a different machine (for eg., utilizing the same worker machine used by the map task can avoid a fraction of the data transfer cost). Also, on a master failure though it could be rare, it doesn't make sense to restart the whole MapReduce operation when some or most of the map and/or reduces tasks were completed (and which involves many terabytes of input and burned compute cycles). Master failures could have been handled better considering the huge number of average worker machines (157) involved because, the cost of restarting the whole operation for one master failure doesn't make any sense.

The MapReduce framework provides a common interface for programming many types of data-parallel problem. This will reduce the turnaround time for parallelizing such problems but involves a trade-off of implementing a huge MapReduce runtime and the number of problems a company wants to parallelize. It made sense for Google who had many problems (4000+) that they could parallelize easily using a single implementation of MapReduce.

Posted by: Venkatanathan Varadarajan | April 12, 2012 07:59 AM

This paper is about MapReduce, a system developed at Google for distributed computing. The programming model exposed by MapReduce is based on a well-known paradigm from functional programming. In MapReduce, the user supplies a map function and a reduce function. Data is partitioned and delivered as input to multiple instances of the map function as key/value pairs. The map function instances then output a set of intermediate key-value pairs. The intermediate pairs are delivered to the user-supplied reduce function such that all intermediate key-value pairs with the same key are delivered to the same reduce function instance. The reduce function outputs a list of values.

MapReduce provides a number of optimizations and additional features. The master can detect stragglers (worker threads that are taking too long) and launch redundant tasks on other workers in response. It can also detect worker failure and relaunch tasks appropriately. Users can supply their own partitioning function if needed. MapReduce guarantees that the intermediate key-value pairs are processed by reduce tasks in ascending order. MapReduce allows the user to supply a combiner function, which is a sort of intermediate reduce function that can be used to prevent a reduce function instance from getting much larger amounts of data than other instances. MapReduce can detect probable "bad records" -- i.e., records that cause the user functions to crash -- can skip over them. The authors supply an alternative implementation of the MapReduce library that runs everything locally for use in development, testing, and debugging.

The only major problem with this paper that I could find is that the authors do not compare the performance of their MapReduce programs to the equivalent single-machine implementations. In parallel programming, this is very important in showing the benefits of a parallel version of an algorithm. Although it is certainly clear, from the descriptions of their implementations, that the MapReduce version would be much faster than the serial version of grep and sort, the authors do not give speedup or efficiency results, so we do not know, for example, how the performance of MapReduce scales with the number of nodes available. This information is extremely important. The ideal for data parallel applications is linear speedup, but we don't know how close MapReduce gets to this goal.

Regarding the system itself, I do not have much to say. Most of my objections to it involve things that could be easily added. For example, for some applications, it might be nice to allow map instances to communicate with one another. Under the current implementation, it is possible for programmers to implement this themselves, but it would be useful if MapReduce supplied a lookup service for tasks.

Posted by: James Paton | April 12, 2012 07:31 AM

This paper describes MapReduce, a programming model for processing large data sets, and the implementation of it. The model transform data in two steps -- map and reduce. The map function takes a pair of key and value, and produces intermediate key, value pairs. The reduce function takes an intermediate key and all of its associated values and merges the values to generate a value or a list of values for the key. MapReduce abstracts out the complexity of parallelizing the computation, distributing the data and handling failures so that the users do not need to worry about those. The implementation discusses Google’s optimizations such as improving locality and assigning backup tasks. Although some of the optimizations are discussed in the context of other Google’s systems such as GFS, the general idea could be applied for other implementations. The experiments further demonstrate that MapReduce is scalable and can perform computations on terabytes of data in a matter of minutes by operating on thousands of machines.

The main contribution of the paper could be the programming model of using map and reduce functions. It allows the programmer to focus on what he wants to achieve with a large set of data without the need to worry about underlying details of the system such as parallelizing the computations and handling server failures. The model is also powerful enough to handle a variety of computations we could possibly want to do with a large data, as described in the examples in the paper. I think the model could be used for computations that do not require the global knowledge of the data.

Google’s implementation of the model allows scalability up to thousands of machines. One of the advantages of the large scalability is that computation time is no longer the main constraint for most jobs. The system with thousands of machines has enough resources to finish most jobs in relatively short amount of time. So, repeating small parts of computations to improve other factors of the system becomes possible. We can see this in the implementation discussed in the paper where duplicate executions are done to simplify failure handling, and to improve overall computation time.

Similar to GFS, this implementation also has only one master. This has some advantages such as having a global view and simplifying the design. However, it could reduce availability. The computation is aborted in case of the master failure, and the client need to restart the computation from the beginning. It probably is not a big concern for the workload intended by Google because an average job takes only about 10 minutes. However, it should have some mechanism to recover and continue from partial computation for the jobs that take hours or days.

To conclude, Google’s MapReduce programming model could be useful for many different computations of large data sets. Its implementation also provides some useful tips for other implementations of MapReduce.

Posted by: Min Thu Aung | April 12, 2012 06:26 AM

MapReduce is designed as a simpler method for processing high amounts of data that can easily be run across a large number of machines. It consists of two phases, a map phase which gathers up data and a reduce phase which outputs results from the computation.

MapReduce was built by Google to provide a simple method for processing data in a number of different tasks. It needed to be simple to use while not sacrificing too much performance. The goal was to provide parallelism without requiring a user of MapReduce to be responsible for the details of this parallelism.

MapReduce consists of many different nodes, one of which is the master. The master assigns out tasks to different nodes, consisting of map and reduce tasks. The first phase of MapReduce is the map phase, which creates values consisting of a key and a value. The values from different nodes in the map phase are then separated by keys and provided to the reduce phase. This reduce phase produces some sort of output per key, usually consisting of a single value. A user of MapReduce writes two components for a specific task, one for the map phase and one for the reduce phase. These are then loaded and used on the different nodes. This design has been shown to be useful for a large number of different tasks at Google.

Node failures are handled by giving the task that was assigned to that node to a different node and canceling the task at the node that has failed. Another issue is the reliance on all tasks being completed before MapReduce can continue. If a single piece is delayed, MapReduce may have to stall until that single node completes. One solution that is used is to schedule the last few remaining tasks duplicated onto multiple machines and use the result that comes back first. In terms of performance, the transferring of data from one node to another during the different phases could slow down the tasks. Some sense of data locality has been added to try and assign tasks to nodes that are likely to have the data or be nearer to a node that does have the data.

The paper talks some about some improvements they have made to decrease the amount of data that must be transferred between nodes but does not discuss as much about the general tradeoffs that are made with MapReduce that leads to this large amount of data transfer in the first place. It would have also been nice to learn more about what type of properties make a specific task a good choice for running on MapReduce in terms of performance.

MapReduce can be used for many different parallel data computations in systems as many of the examples in the paper show. It is especially useful in cases where a generic design is needed and different tasks can be easily added in without having to spend a large amount of time making sure that the parallelism works properly for any specific task. Whether performance will be acceptable for a certain task is an important question. MapReduce is also good in situations where the number of nodes may be frequently changing.

Posted by: Daniel Crowell | April 12, 2012 04:21 AM

The paper discusses MapReduce which provides an abstraction for automatically parallelizing and distributing tasks on large data sets. Tasks executed on the MapReduce model consist of a Map function that maps the input key values into a set of intermediate key value pairs and a Reduce function that maps that groups all the values belonging to the same intermediate key and produces the output key value set by merging them. The MapReduce framework accepts the Map and Reduce functions and the input key value pairs and generates the output key value pairs by transparently parallelizing and executing on a large number of worker machines simultaneously.

The MapReduce architecture consists of a Master and workers performing the map or reduce function. The master that takes care of scheduling idle workers, handling failures and coordinating the map and reduce workers. First, the input is split into chunks of 16MB to 64MB each of which is processed by a worker who is assigned a map task. The map function is executed by the worker threads that buffer the output and write to a partitioned space on the local disk. The worker thread that execute the reduce function receive the location and size of the partition from the master and sort the keys. Once sorted they execute the reduce function for each key and list of value pairs producing the output key value pairs that are written to the global file system.

Some issues with the paper are:
1) For the word count example, the paper says that the map function produces each word with a 1 and appends to the output from which the combiner reads and produces intermediate results. Wouldn't it be advantageous to have the work of the combiner transferred to the Map function such that the Map worker maintains a hash table in memory and accumulates the word counts until there is enough memory space and then write the intermediate results to the disk? Wouldn't this save at least the time spent on writing all the records to disk and by the first level of combiner in reading and processing the results again?

2) The paper argues that checkpointing the master is not needed as there is just one master and hence failure is unlikely. I do not agree with this argument which sounds like no steps have to be taken even when there is a single point of failure. The data sets are usually large and take a long time to compute and the failure of the master at the final stages of the computation will then require a redo of the entire computation from the beginning. Checkpointing the master seems to be a reasonable overhead considering the cost of a restart.

3) In many examples the Reduce function seems to be a mere identity function that copies the contents from the local disks to the global file system. This decreases the performance and is not addressed by the paper.

MapReduce is not radically different from the traditional techniques of parallelism. However, in my opinion, the key features of MapReduce that has made it so popular are: (1) Higher throughput through parallelization which is always necessary with large data sets. (2) Abstraction - Parallelizing a task to achieve better throughput is not a new idea, but the task of doing it transparently by making the system handle all the scheduling and failures takes a huge burden off of the programmer and this is probably the key contribution of the paper(3) Code Simplification - hiding all the complexity involved in managing such large systems and requiring the user to specify only the map and reduce functions makes code easy to write, maintainable and easy to debug.

Posted by: Madhu Ramanathan | April 12, 2012 01:38 AM

This paper discusses the concept of MapReduce and how it allows for easy building and deploying of a distributed system application. MapReduce Is an application framework that takes problems that can be mapped into a key/value space where an application can take a key and produce an intermediate value (Map phase) and then take all of the aggregate key/value pairs generated and perform a reduce operation to generate the output. A small example usage they give in this paper is counting the occurrences of words in a document by having the document and a small set of words in that document mapped in the first phase. Then counts from each of the small word sets are then added up to get a full document word count.

There are three major components to a MapReduce style system.
-A Scheduler (or manager) to schedule handles startup of the master node and other administrative tasks.
-A Master Node which contains information about the progress of the MapReduce and also controls the spawning of both map and reduce nodes.
-A worker node. performs the calculation of ether the map or reduce section.

When a user decides to run a job in MapReduce the following steps occur to launch a job and run the job on worker nodes.
-Input data is split into small sections ( < 64MB)
-Master node is launched to control job flow
-Scheduler writes data out to physical servers that will be assigned to worker nodes by the master (this data is laid out in a smart manner to insure that data needed for a worker is close by).
-Periodically data is sent to reduce nodes
-When enough data is collected by the reduce node, these nodes perform the reduction and write the output.

This system is a rather interesting due to the fact that it allows easy scaling for a certain small subset of problems. However this simplicity may come at the cost of performance for some applications. Especially if the application being used in MapReduce does not fit perfectly map to this key/value style computation model. Any changes to an application or a algorithm to make it fit this model may cause performance issues. Another issue that needs to be taken into account is that MapReduce is only as fast as its slowest running worker. So any application built on this framework needs to have nearly constant time for each segment of data run on the worker. If there are wild variances is run-times between workers it could cause a rise in total runtime. Another issue that needs to be considered is that of data. Applications that pass small amounts of data to the reduce phase are preferred while applications that pass larger amounts of data may drastically slow computation time and limit scalability when using MapReduce.

Overall this paper was quite good and MapReduce is now fairly wide use today. While an application written using MapReduce may never match the performance of a custom solution, it is a cheap and easy way to get some speed up. This likely is enough for most applications and can save time and money in building an unneeded custom framework to get the maximum theoretical performance from an algorithm at scale.

Posted by: Benjamin Welton | April 12, 2012 01:33 AM

This paper discusses design and implementation of MapReduce as a simple and efficient programming framework for highly distributed computing. MapReduce hides the details of job scheduling and fault tolerance so that users can focus on decompose the problem into parallel subproblems and easily make use of large clusters of computers.

People are dealing with larger and larger data sets and it is necessary to use the clusters of computers to improve the speed of computation. However, this is not easy. To combine the computation capacities of many computers, there are many issues to solve, including how to distribute data, how to do load balancing, what to do if some of the computers fails. It is possible to have special designs and algorithms for every application. But it would be more efficient for developing and simpler to use if a framework exists that can hide the details of parallelization and only expose problem solving implementation to users.

The key contribution is that MapReduce is easy to use and provides decent speed up. Apparently, MapReduce cannot promise to get the best parallelization, load balancing or speed up for every application. It is not panacea, but MapReduce does support many useful application and enables more people to enjoy the benefits of big clusters. I think the magic of MapReduce lies in transition from map phase to reduce phase. The map phase produces R partitions, each for one reduce instance. Then reduce workers can fetch data remotely that is responsible by them. MapReduce is simple also because it only has a single master. This master can maintain the status of every map tasks and reduce tasks and status of every workers. Having a global view simplifies fault tolerance and makes MapReduce more robust. So the key idea is very simple and it seems like to me that highly scalable ideas are usually simple.

Besides simple design, Some parts of the implementation are also great. The backup task is a great idea because it identifies the bottleneck of computation and solves the problem with very little cost. When a MapReduce operation is close to finish, some in-progress tasks will be re-executed. The assumption here is that those tasks are more likely to be stuck by some reasons, rather than executing normally. By scheduling another copies of these tasks, it is highly possible that those tasks would not be stuck at the same place. There is a trade-off here, that is how to define “how to finish”. The earlier the rescheduling starts, the earlier we can restart those stuck tasks, but more normal tasks are also going to restart and waste computation resources. On the other hand, the later the rescheduling happens, fewer normal tasks are involved, but more tasks get stuck. Another part that impresses me is that MapReduce provides very friendly debugging support. Even though it is not a major topic of this paper, this also shows that MapReduce is easy to use.

One flaw of the paper is that the performance part only gives data about MapReduce while no data about other methods is given. So I can see the MapReduce is working large data set and give good results. But how good it is compared to other systems? One exception is that it mentions that the previous best reported result for TeraSort benchmark. But I still need the specific setting for this previous number to make sense. The fact that MapReduce can produce better number cannot prove MapReduce is good unless we can fix other major variables.

In all, MapReduce is a very popular technique for distributed computing. Google is using it for several commercial service, which proves its validity.There are also open source implementation of MapReduce. More and more applications are adapted to MapReduce. So there are still many to expect on MapReduce.

Posted by: Xiaozhu Meng | April 12, 2012 01:19 AM

Victor Bittorf, David Capel, Seth Pollen, Igor Canadi
MapReduce is epic, well known. In short summary, you have a map function and a reduce function -- first you map, then you reduce. This is all distributed, and you read from and write to GFS. Everything is fault tolerant (except masters...?). This is MapReduce, and it is hugely popular and hyped in the web and database communities.

The origins of MapReduce come from the need to do computation at web scale. Doing anything at web scale requires a lot of computation power which can come in the form of either a few big computers or a lot of small computers. Small computers are cheaper but fail, so we make things fault tolerant to compensate. But fault tolerance and large scaling computer is difficult and most programmers are ignorant of such complexities. Thus MapReduce is born: computing at web-scale for non-system-guru programmers.

Google’s magic sauce spread into Hadoop and proliferated all around the web. The fact that it is clean, simple to understand, and reasonably fast contributed to this popularity. Now even some no-name google engineer can program a 10000 node cluster. Flowers bloom in the valley and everything google touches turns to gold. One member of our reading group hails this paper as the “biggest contribution in the last decade” (I suppose the facebook “like” button is a close second).

In our department, some of our database seminars pay homage to MapReduce in one way or another. But let us turn away from this sweet kool-aid and instead have a taste of reality: yes, google uses MapReduce every day, yes, MapReduce means money-in-the-bank for some $billion companies.

When I’m listening to a talk and I see a slide come up showing the performance results of a MapReduce (hadoop) job “running on 24 nodes” I think to myself, “what does this mean?” And, after much deliberation in our group, we decided that it doesn’t really mean anything. Running MapReduce on 24 nodes is, as far as we can tell has very little grounds for comparison. MapReduce really shines on large unstructured datasets (very very large). It is common to see people quoting figures from small instances and small problem sets -- but this isn’t what MapReduce is good at -- you might as well just use the debug version and run it on one big computer. You’re not going to have failures if you have 20 computers, and yet you are paying a high price for redundancy and scalability.

Basically, MapReduce really only pays for itself at web scale. So then, a benchmark of 20 computers processing 30GB of data isn’t very meaningful, but people report findings for such problem instances. This in some way reflects how popular MapReduce has become, perhaps overly so.

The paper talks about tricks and addons they have to make MapReduce faster, but this really is a mixed bag. Speeding up MapReduce is, in some ways, just required to make it viable. In the opinion of some of our group members, a job on MapReduce will run significantly slower than a cluster custom built for that job. Of course MapReduce has significantly less development overhead, but at the end of the day these ‘speedups’ such as pre-aggregating at the mapper node are only minor speedups made to a system that has largely sacrificed performance for the sake of simplicity.

We believe that this trade-off space between simplicity and performance at the scale of mapreduce is interesting and incredibly subtle, but the paper only briefly touches on it (mainly only talking about locality). Our suspicion is that this is a rich trade-off space to which google has applied their own blend of proprietary magic; hence the rumors that google’s MapReduce is way better than Hadoop.

Our group is divided on this point and we would like an answer: “How many nodes do you need in your system for MapReduce to be worth using?” As we already said, 20 computers and 30GB of data seems too small, since this could be done in-memory on machine. There are some members in our group, however, that maintain that mapreduce is worthing using even on a cluster of 2 or 3 nodes -- the position was even taken that “there are no jobs that are really bad for mapreduce (aside from large scale physics simulation or other niche problems).”

Our group is divided on if mapreduce is the panacea for all (or at least the vast majority) of large scale computing...

Posted by: Victor Bittorf | April 12, 2012 12:54 AM

MapReduce is a model for parallel computation focussed on aggregating results from huge amount of log file entries. The paper gives an outline of the high level architecture used in map-reduce. A single master node acts as a coordinator which assigns the map and reduce workers, as well as directs the workers to the appropriate outputs of previous step. This single master may be a problem with respect to availability if the master goes down, but mapreduce is generally executed as an offline process and hence the failure is not that critical. There are current implementations of mapreduce like mapr which does namenode replication to support availability in the presence of failures. There is optimization that happens when the master node recovers from a failure as the computation can be restarted from a checkpointed time. I feel that the system has been implemented in a very simple and efficient manner. One example of such simplicity is the fault tolerance mechanism that is employed when worker threads are killed or the machine executing the threads are killed. Also, the whole data exists on GFS and data is replicated to around 3 machines to account for availability in the presence of failures. There are several optimizations that are done in google's mapreduce implementation, like
a) locality of worker processes with respect to the data for map and reduce states
b) Tolerance to slow processes which slows down the whole mapreduce operation. The backup tasks are a good way of optimizing this. This way, load can be more efficiently utilized in the whole cluster.
c) Combiner function to batch many maps on a key to a single map emit.
d) Skipping retries of bad records
One of the most difficult parts in implementing mapreduce as a developer is to debug the application since it is distributed over a wide variety of machines for map and reduce. This is made easier for the developers to an extent in the their implementation by having a job tracker that monitors each worker process in the job.
If state of map workers are frequently checkpointed to GFS, would it be a better idea to restart the process from the state on the disk rather than restarting the map or reduce process if a failure happens. I know this is a more complex implementation strategy, but in case of huge reduces, if reduces fail just before the end of execution, the whole operation is restarted.
If the locality is used, then is there a way to make sure that the some nodes are not more loaded compared to the others. I am not clear as to whether the load in the cluster used before assigning a map and reduce worker some input. If so, is load monitored at a high level based on the number of outstanding jobs present in the system. This information is present in the master node.
Overall, using the architecture provided in google, I feel it is pretty simple for any programmer to implement map and reduce operations. The developer without worrying too much about fault tolerance can go ahead implementing the parallel logic in the application. There are factors however that must be taken into careful consideration like the number of map and reduce workers. The performance metrics in the paper for the various operations clearly illustrate how much effect the distributed computation has as opposed to the normal sequential one. It will be interesting to know what the current mapreduce infrastructure at use in google is and what additions have been made to it.

Posted by: Srinivas Govindan | April 12, 2012 12:22 AM

MapReduce is framework for completing parallel and distributed computing tasks
within a simple, constrained programming model. The model requires the user to
write two functions: one to "map" a large problem onto many worker nodes, and
one to "reduce" the output of these nodes, whose functionality can also be
carried out on many worker nodes.

It appears that the problem that MapReduce was originally created to solve was
to simplify the programming model of large web crawling and analyzing tasks on a
large amount of machines. MapReduce provided a solution to this and also
provided a generic framework to do other useful things like sorting and
searching through very large amounts of data. This paper contributed a
programming model, first and foremost. The simplicity of this model is a
definite contribution which led to its popularity, as the authors implemented
more complex tasks like fault-tolerance and load-balancing into the MapReduce
library and away from the programming model. This idea seems to be the largest
contribution, as this specific implementation was designed for Google
datacenters running the Google File System, but the library has been implemented
for a variety of platforms.

An obvious piece of the design to pick on is the single point of failure or
possible overloadable master, which keeps track of system information on each
worker node. However, I believe their reasoning makes sense in that it is okay
for a compute job to fail because of a master node failure (which is made rare
by the fact that there is only one). Perhaps a flaw is in the fact that I
did not see a compelling argument for why this one master absolutely needed
to be the conduit of files created by the map stage to the reduce stage.
The single master does allow "backup tasks" to be executed to preserve
liveness of the program. This information on program state leveraged by
backup tasks would not be possible without a single master, but it also needs to
be updated very regularly by worker nodes, increasing tension on the master.
Also, this paper and its optimizations relied very much on Google-based
software, some of it proprietary. This may not be applicable to many other
distributed systems.

This paper is quite applicable to real systems because parallelism is needed for
great performance of large-sized and/or data-intensive problems. MapReduce
shows how simplifying a programming model, while sacrificing some power and
control in the model, makes this parallel programming a lot easier to achieve. A
framework like this is great for distributed NUMA hardware and always viewed as
something to consider if you are building such a system (or datacenter).
Obviously, some analysis should be done on whether a machine could even run
MapReduce functions to its benefit, like Google's conecption.

Posted by: Evan Samanas | April 12, 2012 12:12 AM

The paper talks about a programming framework where the users can parallelize
their task by expressing the computation as a map and reduce function. The
frameworks lets one process huge amounts of data and perform computations on
them by paralelizing it across different machines.

The framework requires the programmer to specify the computation as map
function that processes the input into intermediate key and a reduce function
that aggregates the intermediate key and its value. The programmer defines
two parameters M and R which decides on how the input is partitioned. It
splits the input into M parts and assigns each split to an idle workers who
produces R outputs which are written locally. Once the mapping is done, the
workers sort the intermediate values and the master assigns a worker for each
of the R intermediate values on which the reduce function is applied to
produce a final output which is written to a global file system. The master
takes care of balancing the load on the workers and detecting failures in the
workers and rerunning the tasks. It does so by marking tasks from failed
machines as idle and assigns it to the next available worker. To overcome slow
workers, it duplicates the computation and kills off the other worker once
either one of the workers complete the task. Master failures are tolerated by
checkpointing and restarting. Most of the workers read the input from local
disks, and the master schedules the M splits to workers to take advantage of
the locality of data. This way, the master moves the computation closer to the
data instead of moving huge data around.

The main contribution of the paper is the idea of abstracting the
parallelization and fault tolerance from the actual computation being done.
Most computations can be perceived as a series of map-reduce operations and
its easier to start a map reduce job, add and remove machines as they get idle
from other tasks. Though the paper claims that the programmer need not be
aware of writing parallel programs, one needs to understand how to model the
underlying problem into a map and reduce computation. I think map reduce is
ideal for computations which operate on huge amounts of data in terra bytes
and involves simple sequence of operations on them.

The key part of sorting the intermediate keys is not explained clearly and is
briefly described in the related work sections. The paper does not clearly
give a track of how the master tracks the load on workers cpu-intensive vs
memory/IO intensive. Not all jobs that can be done using parallel algorithms
can be done using map-reduce computation and I think the tricky part is to
model it as a map-reduce computation. In some cases like matrix multiplication
etc, the workers can perform better if they have some synchronization among
themselves.

Posted by: Sathya Gunasekar | April 11, 2012 11:02 PM

The paper introduces the MapReduce framework that helps programmers to process large amount of data on top of a distributed file system.

The main challenge is to develop such framework so that the details of parallelization, fault-tolerance, load balancing and so forth are hidden from the user. The solution is the simple MapReduce approach, where the user only implements a mapper and a reducer. The data is partitioned where each mapper is responsible for generating intermediate key-value pairs from the pairs in its partition. The reducers are responsible for reading the outputs of mappers and merge these intermediate (key, set of values) pairs.

Main contribution of the paper is the introduction of MapReduce paradigm. Although there are critics that say the ideas are not novel (already exists in parallel database management systems), the easy-to-install/use interface that can easily scale to large clusters is itself a valuable contribution. While I can start using map reduce and writing initial programs in couple of minutes, there is no such easy to use parallel database management systems around.

MapReduce does not guarantee transaction management (ACID) as RDMSs provide. The failures are handled very naively. I don’t know the current status of the system but the master described in the paper assigns the job to some other worker in case the running worker fails; and it simply aborts/restarts when the master fails.

One of the critics that I agree is the lack of indexing and random access features. This makes MapReduce too weak and inefficient in search, in comparison with other standard database management systems.

Another missing feature is the declarative querying language such as SQL. I believe this would make a great addition towards usability and scale of applications. On the other hand, some of the current systems (such as Hive) have developed such features on top of MapReduce already.

Although some ideas like indexing, declarative querying would be useful for MapRaduce, I would also like to point out that it is somewhat inappropriate to compare relational database management systems with MapReduce. MapReduce targets massive computational tasks that you want to distribute among many machines; which relational DBs cannot do.

People are developing many systems related to MapReduce and Hadooop (e.g., Hive, Pig, Cassandra, Mahout, etc) and there are many companies that sell products related to MapReduce. I believe it will get more and more popular in the future. On the other hand, there is an obvious need for real-time search and data processing (on large data sets) which MapReduce is not suitable since it is only designed for offline massive data processing. I believe future improvements on MapReduce will be on making it more efficient on real time jobs.

Posted by: Halit Erdogan | April 11, 2012 10:19 PM

MapReduce: Simplied Data Proessing on Large Clusters

In this paper Dean and Ghemawat present a google framework for simplifying mass data proessing on large clusters. It's a very interesting system, but it seems to be very tailored to a google specific work-set so the general utility of mapreduce might be a bit questionable. It also seems to lack in supporting infrastructure ( ie debugging, tolerance to load etc ).

Can map reduce be extended to non-discrete data sets? Ie. Time series data? Splitting time series data by some chunk could end up with an event spanning chunks. An example application might be processing mpeg video. If there is not an I frame at the start of each chunk processing might be hard as there wouldn't be a complete picture of a frame.

If memory serves GFS did not have a way to automatically re-route load if a particular file became a hot spot. Yes, files could be changed to have a higher replication factor if required, but not in an automatic fashion. If the scheduler mentioned in '(5)' has knowledge of the load on the file system it might be able to temporarily increase the replication factor for a file, lowering it when finished ( maybe in an exponential backoff fashion? ).

The topic above invites the question of what load levels are typically seen by a google cluster? The map reduce paper makes it seem like there is no mechanism to keep a particular job to monopolize the entire cluster. Why couldn't a job be run that consumes all memory in the cluster. Especially when the map phase of map reduce buffers all intermediate key-value pairs in memory.

They do not take advantage of having a common API for map reduce. On 141 they mention that network maintenance was causing groups of 80 machines to go unreachable for a few minutes at a time. Why couldn't there by an administrative API to take those groups of 80 machines out of the pool of mahines available to have jobs scheduled on them and have some time of callback to note when all jobs running on these machines where complete making them available for maintenance. Yes, the system continued to make progress in the face of this maintenance, but obviously work was wasted.

It unclear how this system will degrade under load. They claim that network bandwidth is a precious commodity. High network utilization leads to high latency to fetch non-local data in gfs. High latency in non-local data in gfs might mean worker threads would take longer. As a job comes to an end any workers not compeleted can be labled stragglers and have backup jobs scheduled. Couldn't this end up increasing load on already loaded parts of the system?

Debugging a problem on a map reduce cluster would seem to be problematic. Particularly depending on the users interpretation of however the output from the skip-bad-record functionality. If it's not very obvious it could lead to misinterpretations by the user leading to skewed results. It's odd that they didn't mention any kind of build in debugging facility for when a map or reduce function crashes. Ptrace or something of it's ilk could be used to get a stack trace at the time of the crash so some information would be available to the user / developer.

Posted by: Matt Newcomb | April 11, 2012 08:53 PM

This paper introduces a programming model MapReduce. This model automatically parallelizes the jobs and executes them on large cluster of commodity machines. This paper also presents an associated implementation of MapReduce, which includes techniques used to achieve fault tolerance, higher throughput, and so on.

Although lots of special-purpose parallelized programs are developed to deal with large amount of data, the programming models are various, and usually the interfaces of them are not easy to use. They either have complicated application programming interfaces, or leave the fault tolerance to the programmers. Also, it’s hard to write the programs that can be executed on large number of commodity machines. Thus, whether there’s a general-purpose programming model for most of distributed computation becomes a question, as well as how to make it fault tolerant and efficient.

The main contribution of this paper is the abstraction of general-purpose programming model - MapReduce. MapReduce consists of map operation and reduce operation. Map operation takes an input pair and produces a set of intermediate key/value pairs. Reduce function accepts an intermediate key I and a set of values for that key, and merges together these values to form a possibly smaller set of values. This programming model is abstracted from the experience of distributed programming developments in Google, and is applicable for most of the parallel computations.

Also, this paper presents an implementation of MapReduce. It is implemented in a form of a library. When the MapReduce program starts, the program is forked into master process and some worker processes for map and reduce tasks. The master keeps the states of map and reduce task, and the identity of the worker machine. To deal with faults, MapReduce library can deal with worker failures and master failures. The worker node is marked as failed if the master doesn’t receive a response from that worker in a certain amount of time. Tasks which failed, together with completed map tasks on the same failure machine are re-executed. If master fails, the MapReduce computation is aborted, and clients can retry the operation if they desire.

Optimization on locality is involved in the implementation to reduce the traffic on network. MapReduce master takes the location information into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Backup tasks is used to reduce the impact of slow machines.

Some other extensions are useful and also provided in the MapReduce library. For example, the library can support user-defined partitioning function, combiner function, user-defined input and output types, the ability to skip bad records, counter, local execution for debugging, profiling and small scale testing, status information for human consumption, and guarantees on ordering.

This idea sounds simple but useful. However, one flaw of this paper is the absence of discussing the management of resources. Different instances might be executed at the same time on the same set of machines. In this case, the “straggler” is much likely to happen. A good resource management might be necessary to better isolate MapReduce instances to achieve better performance.

Overall, this paper is well-written, and presents a simple idea of MapReduce and the corresponding implementation, which I think is still applicable nowadays. It is abstracted from majority of distributed computations and is applicable for various applications. Besides the implementation by Google, Hadoop includes an open-source implementation of MapReduce and is widely used among large amount of companies and organizations. This also shows the applicable of MapReduce nowadays and its success.

Posted by: Xiaoyang Gao | April 11, 2012 03:30 PM

This paper talks about the MapReduce programming model and runtime system that allows users to express processing of large data sets using two primitives “map” and “reduce” and abstract away the distributed, fault-tolerant implementations of the two primitives (often, over a cluster of commodity workstations).
The chief problem the paper is trying to solve is ‘providing a simple and highly abstract programming model that allows for the definition of tasks that operate on huge datasets and the transparent execution of the tasks on the underlying infrastructure (which is often a cluster, and hence the execution must be distributed and fault – tolerant)’.
In the past, the problem of transparently processing large data set over a cluster has been solved for other data models (parallel DB). Also, the programming model and implementation aren’t novel (map and reduce are commonly used higher order functions in functional languages and MPI creates and executes task on a cluster, respectively). However, what is novel about MapReduce is that it adds transparency (process management, fault tolerance) to MPI styled execution (distributed execution) on a KV data model and provides a functional programming interface to the user. This is probably the biggest contribution of the paper.
Flaws:
Probably, the biggest flaw in the paper is that the functional paradigm (and the original semantics of map and reduce) seem(s) to have been thrust into MapReduce just for the sake of the names ‘map’ and ‘reduce’ being borrowed from the functional world. The primary idea behind functional programming is the absence of destructive assignment. The program cannot have a ‘state’ that is updated by statements – the input is transformed from one form to another (a list to another list, a list to a value etc.) by every application of a function, and finally we get the output. The (original) map function operates on a list and applies the map operation on all elements of the list producing a new list. The (original) reduce function applies the reduce operation to all elements of a list resulting in a value. These functions really make sense only in a functional setting, where can’t have or update any state of any form.
However, for the problems MapReduce was supposed to tackle, there are no sufficient grounds as to why a functional paradigm is even necessary. Just because of the functional paradigm, map ends up producing a humongous intermediate list (which is later reduced by the Combiner function). Since the amount of data being processed is non-trivial, this ends up using a lot of IO. Instead, if there were no functional paradigm to follow, map could have used some state (say, a hash table) to have the intermediate values. Instead of appending an intermediate KV pair to the output list, we could get the existing value of the intermediate key from the state, update the value (probably by applying the reduce function) and store the new value in the state. The intermediate states produced by the workers can be partitioned and assigned to R workers which further apply reduce and produce the final values. This would reduce a lot of space used by the intermediate data and save a lot of IO and network traffic.
The next flaw is the lack of sufficient information about dynamic load balancing. Load balancing of the reduce tasks can be done statically or dynamically. Static load balancing is smart partitioning such that each reduce worker ends up with relatively the same amount of work. Dynamic LB is reassigning a part or the entire work of a loaded worker to another process. The paper doesn’t do static load balancing (it says that the partitioning info should be provided by the user). This is rather naïve given the existence of many sophisticated partitioning algorithms – especially in parallel hash based joins. The paper mentions dynamic load balancing in the experiments section, but nothing about it is mentioned in Sections 3 or 4 (If ‘Backup tasks’ is dynamic load balancing, then it a rather naïve thing to do). The master does not seem to do dynamic load balancing by incorporating size of datasets assigned to workers and the processing power of workers. So what happens if a reduce partition is too big or small? Also, the locality optimization does not really seem like an optimization in the implementation algorithm proper, but just seems as an offshoot advantage of GFS.
There is nothing rather interesting or innovative about the paper (The paper either punts interesting problems worthy of engineering innovation to the user, or resorts to naïve solutions). I dare say that the wide audience for MapReduce is probably because of the convenience offered by transparency and fault tolerance guarantees and the familiar interface, and definitely not due to engineering innovation of any sort.

Posted by: Venkatesh Srinivasan | April 11, 2012 03:29 PM

In this paper, the author described the MapReduce, a large scale parallel key-value data processing system.

Google has a huge amount of web data needs to be processed within limited time, such as computing the reverse web-link graph, inverted index and so on. MapReduce is designed to solve these problems by providing a simple interface for programmers to process the data parallelly over a large cluster of commodity PC, and the programmers can focus on the logic of the task without worrying about the details of parallelization.

MapReduce provides a quite simple programming model. It consists of two parts. The Map function processes the input key value data and generates intermediate key value data. The semantic of new key doesn’t need to be the same as the input key. The intermediate data will be output to R files where R is the number of Reduce tasks. The Reduce task takes a file from each Map task and groups the key-value pairs with the same key. Each Reduce task will generate a final file contains part of the total result which can be further processed by another MapReduce job or user tools.

Since the MapReduce is implemented on cheap commodity PCs, it must be able to handle the failure of machines gracefully. If a worker machine fails, the Map tasks and Reduce tasks running on it will be restarted on other available machines. If the master machine that schedules and monitors the tasks fails The whole job might have to be restarted under the implementation at that time. The author argues that the single master server is unlikely to fail and we can have check point on the master machine to reduce the work of restart.

Other mechanisms are introduced to improve the overall performance of the system. First, it has back up worker machines for the very slow workers to speedup the overall job. This is because sometimes there are stragglers that are very slow due to resource competition with other tasks or hardware problems, and the whole job has to wait for them to finish. Network is a scarce resource for distributed systems. In the MapReduce, it takes advantage of the locality and replication to reduce the data transmitted through the network.

The MapReduce system has been used in Google for many production jobs including the web page index for the Google search. It shows the performance and reliability of the MapReduce system. The author also provided the evaluation of some sample jobs on MapReduce to give us a better understanding of the performance of the system under different situations. I think MapReduce is a innovative distributed system that has a big effects on the whole industry. The Hadoop system which started from this paper has become one of the most popular distributed system. Microsoft Scope also has similar programming model.

Posted by: Xiaoming Shi | April 11, 2012 01:45 PM

In this paper, the author described the MapReduce, a large scale parallel key-value data processing system.

Google has a huge amount of web data needs to be processed within limited time, such as computing the reverse web-link graph, inverted index and so on. MapReduce is designed to solve these problems by providing a simple interface for programmers to process the data parallelly over a large cluster of commodity PC, and the programmers can focus on the logic of the task without worrying about the details of parallelization.

MapReduce provides a quite simple programming model. It consists of two parts. The Map function processes the input key value data and generates intermediate key value data. The semantic of new key doesn’t need to be the same as the input key. The intermediate data will be output to R files where R is the number of Reduce tasks. The Reduce task takes a file from each Map task and groups the key-value pairs with the same key. Each Reduce task will generate a final file contains part of the total result which can be further processed by another MapReduce job or user tools.

Since the MapReduce is implemented on cheap commodity PCs, it must be able to handle the failure of machines gracefully. If a worker machine fails, the Map tasks and Reduce tasks running on it will be restarted on other available machines. If the master machine that schedules and monitors the tasks fails The whole job might have to be restarted under the implementation at that time. The author argues that the single master server is unlikely to fail and we can have check point on the master machine to reduce the work of restart.

Other mechanisms are introduced to improve the overall performance of the system. First, it has back up worker machines for the very slow workers to speedup the overall job. This is because sometimes there are stragglers that are very slow due to resource competition with other tasks or hardware problems, and the whole job has to wait for them to finish. Network is a scarce resource for distributed systems. In the MapReduce, it takes advantage of the locality and replication to reduce the data transmitted through the network.

The MapReduce system has been used in Google for many production jobs including the web page index for the Google search. It shows the performance and reliability of the MapReduce system. The author also provided the evaluation of some sample jobs on MapReduce to give us a better understanding of the performance of the system under different situations. I think MapReduce is a innovative distributed system that has a big effects on the whole industry. The Hadoop system which started from this paper has become one of the most popular distributed system. Microsoft Scope also has similar programming model.

Posted by: Xiaoming Shi | April 11, 2012 01:44 PM

CS 739 Reviews - Spring 2012

MapReduce: Simplified Data Processing on Large Clusters

Comments

Post a comment