CS 739 Review Blog: MapReduce: Simplified Data Processing on Large Clusters

Summary
This paper describes a parallel batch job programming model called Map Reduce. Map Reduce is mildly similar to Condor where users submit jobs, a supplied Map and Reduce function to be applied to a large data set, and the system handles all fault tolerance, scheduling, and distribution.

Problem
Programmers want the ability to process large amounts of data in a fast and efficient way without worrying about all the details involved with distributed systems. This programming model supplies that.

Contributions
Map Reduce makes numerous contributions.
1. A new and very simple abstraction for processing any size of data that fits into this programming model. Many data processing jobs do fit into this category, or can be modeled to.
2. Fault tolerance is a first order consideration. The master node monitors workers and restarts chunks of the work if necessary.
3. Locality is taken into consideration when scheduling nodes, and performance enhancements (making sure the last few stragglers don’t bog down the job).

Applicability
I think this is a good abstraction for programmers. I think the use is limited to companies or research involving large to internet scale data. But most companies probably would like a way to process the data they collect and this gives an easy abstraction to do so.

Posted by: Jesse Benson | April 15, 2010 03:19 PM

Summary:
This paper introduces MapReduce programming model, which uses parallelism to handle large data sets in distributed systems. This model could provide fault tolerance by re-scheduling failed tasks and also considers I/O locality to reduce the network traffic. The process is:
1. Master split input file and forwards each sub-file to slave nodes.
2. Each node operates the map function and produces intermediate key-value pairs.
3. The reduce function accepts an intermediate key and a set of values for this key and merges these values to get the result.

Problem:
The problem this paper tries to address is to build a distributed computation framework to operate large-scale datasets. It uses the Map and Reduce programming model to provide user an easy-to-sue library, which considers the I/O locality, load-balance and provides fault tolerance for node failures.

Contributions:
Compared to the traditional mechanism (the Parallel DB), Map-reduce has several advantages:
1. There is no requirement for the input data format, which could handle different applications very well.
2. Using a master node to do the scheduling and storing the intermediate files in disk enables the system to tolerate node failures by rescheduling the work to other available node. It considerably increases the system efficiency, because the traditional mechanism method, for example the Parallel DB, needs to redo the whole work and waste much time.
3. The master node also provides much flexibility to design different scheduling for different applications.

Questions/Comments:
Compared to the simple method that read the input file sequentially and doing the operation, the MR increases two I/O operations: the Map nodes need to write the intermediate data to disk and the Reduce nodes need to read them from disks. So I’m wondering why MR could provide better performance over this simple method. If the reason is MR could use a parallel I/O, why do MR still need to do the partition? We can just use the disk arrays to enhance the I/O performance. I guess only when there are actually a very large number of nodes and a very big input files can the MR provide a much better performance.

Posted by: Ao Ma | April 15, 2010 02:38 PM

Summary:
The paper describes MapReduce, a programming model that supports parallel execution of a computation task on a cluster of machines.

Detailed Description:
Often, programmers need to perform huge data intensive computation. Doing this on a single computer would take a long time, while buying a powerful machine is too expensive. One idea is to use the compute cycle of multiple machines on a cluster. If the computation is inherently parallelizable, this would provide very good performance. MapReduce is a programming framework that lets users abstract a task into two phases: Map and Reduce. During the map phase, a set of pairs are operated on to produce another set of intermediate pairs. The reduce function takes the intermediate values to produce a set of values as the result. It then becomes easy to distribute and execute map and reduce operations to different processing entities. There are quite some applications that are amenable to a MapReduce processing, like the authors note.

The way the paper implements MapReduce is somewhat similar to Condor, with the addition of parallelization: A master program initiates and keeps track of a bunch of worker programs performing map and reduce operations. The Map program reads a list of tuples from the input file, process it and signals to the master. The master then signals the reduce instantiations, which pull the intermediate tuples from map locations. The output of reduce operations are stored in files. Google's MapReduce library also has an array of features and optimizations like tolerating worker failure, choosing workers close to the input file locations and using a temporary combiner at the map location.

Contributions:
From what I get from their related work, the contribution of this paper is the generic idea of having Maps and Reduces to solve a parallelizable problem in two stages. The other ideas like using idle machines in a cluster is old, and the area of parallel computing was already quite active at that time.

Applicability:
While this is not novel, I think it is a neat way of abstracting parallel computation and crafting it as a program that makes its distribution easy to implement. There is a wide variety of applications that can be parallelized into Maps and Reduces. This framework definitely has a lot of applicability, as is evident from its wide adoption, and the popularity of many MapReduce systems. It is highly scalable, and gets work done quickly using a cluster of commodity machines. As with comparing MR to databases, I think they were both designed to solve different problems. When MapReduce became popular, it appears like the database people patched up their design with UDF, to show that it can theoretically do whatever MapReduce does. Like the other paper says, these two technologies complement each other, and there are applications where one of these could perform better.

Posted by: Chitra | April 15, 2010 02:38 PM

Summary
--------
The paper describes Map Reduce, the highly popular programming model for
distributed systems.

Programmer just provides the Map and Reduce tasks to carry out a particular task. Map Reduce framework internally splits the input data, creates multiple map instances and reduce instances to work on the data. This way, the programmers are relieved from the burden of parallelizing the application, inter-node communication and fault tolerance in a distributed system.

Interestingly enough, this simple programming interface is good enough to carry out many tasks on large data sets.

Problem Statement
------------------
Programmers who want to use distributed systems are typically bogged down by the need to handle fault tolerance, inter-node communication and parallelization of task. We need a simple programming interface which does most of these tasks for the programmer, yet is not much restrictive.

Contributions:-
---------------
1) Map Reduce programming paradigm.

Relevance:-
------------
The usefulness of the idea can be understood from the many implementations of Map Reduce that are existent today.

It handles data-parallelism very well. However, I feel there is no way to specify and make use of intra-task parallelism.

With the availability of cloud computing, Map Reduce is going to gain further poularity.

Posted by: Laxman Visampalli | April 15, 2010 02:26 PM

Summary and description:
The paper describes a new programming model which will allow automatic distributed computation of large tasks. Such a programming model will ease the writing of programs in which the inherent methodology of the program is easy, but explicitly specifing the distribution of the program along with things like fault tolerance and reliability make it difficult to code. In the new programming model, the users specify the location of the input data files, output files, a method of mapping the input data (which are in key-value pairs) into intermediate lists of key value pairs, and how the output should be derived from the intermediate set of key value pairs. The paper tries to show that this model can be used to easily design a large number of simple, data intensive applications.

Contributions:
1. The design of the Map-Reduce model that will allow processing large data sets in a distributed system without explicitly specifying the mechanics for distribution of the computation.
2. The design of an implementation of the model in a network of commodity PCs.

Applications and comments:
As the paper shows, MapReduce could be used to implement a large number of simple applications which involve processing a large amount of data. Besides Google using the model internally, other implementations of the model (eg. Hadoop) exist. The paper is well written, and describes a model that seems to be useful. However, it still remains to be seen whether MapReduce can be used to accomplish much larger distributed tasks (e.g. a distributed database engine).

Posted by: Thanumalayan Sankaranarayana Pillai | April 15, 2010 02:25 PM

Rather than write a new distributed system to solve new problems as they arise, Mapreduce is aims to solve a more general problem. The algorithm isn't anything new, but it is very common in most code and so it's easy to benefit from Mapreduce.

Mapreduce installations take an efficient approach to consistency. All communication except for task outputs are unreliable since that communication is probably correct. Map and reduce task outputs use two-phase commit for consistency. There is no consistency for 'side-effects' as these should not be an important part of the output, but a status update for the task. I presume that distributing tasks does not need consistency since if the worker does not start the task, the task will just be redistributed with little loss in progress.

Partitioning and availability are implemented in a straightforward way: if a worker does not return results, then the job will go to a different worker; if a master does likewise, the Mapreduce job will be sent to a different master. Since there are many small workers, it also becomes far more available than a small set of database servers. If a worker simply cannot send results back, it will presumably delete the results after some period of time.

Mapreduce works well with all of the distributed computing fallacies, with one possible exception. It certainly is able to load-balance effectively by making tasks small. One might assume that this just gives granularity so each node gets an equal share, but it works in a different way. Nodes that have more slack can take more tasks from the master. If a slow node is working on a task needed to finish, other nodes, which may be faster, can work on the same task to finish earlier. Google probably also just makes all systems homogeneous anyway, but this is probably just administrative and not for load balancing.

Mapreduce is more useful than database servers in a few ways. It is not necessary that a organization need 2000 servers like is the case with Google to make good use of Mapreduce; it can be effective and efficient with just a handful of computers (as long as task replication is adjusted accordingly). It is also much more general than a database server, since it can solve tasks that don't require storage, like finding large primes or factoring large numbers. Most databases are relational and so do not fit many data models like GFS (or just about any other data store) can.

[sorry for the lateness]

Posted by: Markus Peloquin | April 15, 2010 02:15 PM

The Map-reduce is a new computational model for distributed processing of heavy tasks. It exploits the inheret parallelism in the given task to split it into multiple independent sub-tasks that can be executed simultaneously. The results of these sub-tasks are then aggregated by another set of sub-tasks.

This particular paper implements the Map-reduce operation for a cluster of commodity machines. The target execution environment presents several challenges. Networking hardware is also buitl from commodity machines and the aggregate banswidth is far less than the bisection bandwidth. A large cluster of small machines is prone to machine failures.

The Map-Reduce computation consists of a Map phase and a Reduce phase. The Map phase accepts the user given inputs and produces intermediate output as key-value results. The reduce operation processes these intermediate results that produce the final output. Bothe Map and the reduce function are user specified. The intetresting aspect of the paper is its detailed description of the overall architecture of this system specifically targeted for a certain environment. The system consists of many workers executing a Map or a reduce operation. A special worker designated as a master handles the communication between the workers,implements fault tolerance , partitions the data into chunks and distributes them to the workers. To lower the network communication the input files are divided into 16-64MB chunks are written on to the local worker machines.A number of replicas are made to ease the scheduling process. The map task writes the results of its operation to the local disk and passes on the these locations to the master. These results are accessed by a reduce wworker using RPCs. The results of reduce worker are however written to the distributed file system once again to ease the access to the user. The schedulre tries to execute the Map operations as close to the machine containing the corresponding data in its local disk as possible.

Fault tolerane : In this model the master monitors all the workers periodically. It restarts the jobs from the failed workers and reassigns taks to any free ones. This model is suitable for this system as typically the number of workers is very large. Using a distributed alorithm would be unsuitable. Also as they are independent they need not know about each others status. A simple restart for failed workers is also suitable as each Map or reduce operation is quite small and the number of avilable workers many. Checkpointing and restart has been used to manage master's failures.

The paper also describes using a combiner function to post-process the results of a map operation and a user defined partitioning function to partition the data to the reduce operation. These offer quite a bit of flexibility to the end-user.

Applications : The paper describes several candidates for the Map Reduce operation. Distributed sort, grep, reverse web link tree - typically anything that involves heavy computation ,can be parallelised such that the sub-taks do not need to communicate with each other. I think this particular implemetation on commodity systems is unsuitable for time-critical tasks or those that typically require heavy communication between each other.

Posted by: Neel Kamal M | April 15, 2010 02:05 PM

The paper presents MapReduce, a programming model to solve computational problems with large datasets and outputs that can be effectively partitioned over their input data. The paper introduces us to Google's implementation of MapReduce specific to its execution environment of thousands of networked commodity systems.

The problem tackled by the paper is one of implementing a programming model that automates to a large extent, the process of partitioning a problem into sub-problems and distributing the workload on a number of workers and then accumulating the various sub-problems using a user defined function to generate the solution. What is common to most problems that will benefit form this programming model is the presence of large datasets as inputs/outputs or both.

The implementation presented in the paper is best suited for an environment where there are thousands of worker nodes available with each node being fairly small (in computation power) and the networking resources between nodes being a scarce commodity. The implementation takes these factors into account and splits the input data into small chunks and most often manages to localize the data (if not on the same machine, the in a machine close to it in the networking hierarchy) that each worker is assigned. This implementation allows for exactly 1 map and 1 reduce phases but as shown by the various examples, this is a sufficient model for many common tasks that would benefit form it. The system is built to be highly fault tolerant since each data chunk is replicated upto 3 times and the master keeps track of all the workers that are up and running. When worker failure is detected, the master simply assigns another worker to the sub-problem that the failed node was assigned. Master failures are treated as being rare and usually results in the entire problem having to be restarted in the present implementation although logging and check-pointing like in distributed database systems is an option.

The system is very relevant to modern cloud based capabilities since there is a definite trend away from large super-computers to networked clusters where one problem that always remains in porting is the need for programmers to be aware of the environment and implement the communication that is required between nodes to work together on a single problem. MapReduce manages to alleviate this difficult task and automatically sets up the necessary communication to solve the large problem that is submitted. In comparison to parallel databses, it seems like the two are relevant to different domains of problems. The complex schema stored in DBMS applications are necessary for solving complex queries efficiently whereas MapReduce with its simple key-value view of the data cannot be used for solving such queries. But parallel database applications are expensive and MapReduce is able to solve many simple yet relevant problems just as efficiently if not more, for free.

Posted by: Sanjay Bharadwaj | April 15, 2010 01:57 PM

Summary:
This paper describes how Google designs and implements MapReduce, which is a programming model for processing large data sets on large cluster of commodity machines. The users simply specify the map and reduce function, and the system will handle the data partitioning, scheduling, failures, etc. This modelMany can automatically solve many real world problems, while does not require the programmers to have much experience in parallel programming.

Problem Description:
The problems this paper is trying to solve is faced by Google every day: how to process large amounts of raw data, such as crawled documents, web request logs, etc, over a large cluster of hundreds or thousands of commodity machines. The map and reduce idea exists in many functional languages such as Lisp. The goals of this paper include:
-propose an abstraction based on map-reduce, which can automatically handle parallelization and hide details from programmers.
-Implement this model, and provide simple and powerful interface to users
-Achieve performance, failure tolerance, and load balancing.

Contributions:
The main contributions of this paper include:
-the implementation of MapReduce interface, which is very easy for the programmers to use and solves many parallel computing problems as shown in the examples. The experimental results shows very impressive throughput.
-The implementation is failure tolerant: the master detects worker failures and schedules re-execution of these tasks. The system is resilient to large-scale worker failures.
-The implementation also tries other aspects of refinements to speed up the computation, such as locality, scheduling, partial combining, etc.

Applications:
The level of parallelization provided by Google MapReduce naturally restricts the computation models and cannot cover every distributed computing problems. However, it is already very expressive and very useful in the real world applications. The implementation by Google also turns out to be very successful and classic. The single master seems to be the bottle neck and failure point of the system.

Posted by: Chong | April 15, 2010 01:56 PM

Summary:
MapReduce is a programming model for processing large data sets while abstracting implementation details such as parallelism, fault tolerance, data distribution, and load balancing.

Problem:
Data sets continue to grow in size, so the need for parallelized computation in order to manage the data in a reasonable amount of time also is growing. As we have learned in this class, distributed parallel computing is not a simple problem. Google engineers were finding that much of the code they were writing to solve simple problems was just to handle the difficult aspects of parallel distributed processing, making their solutions no longer simple. They needed a way to abstract the distributed system code in order to keep the simple problem’s code simple.

Contributions:
MapReduce is Google’s answer to the above problem. The basic MapReduce algorithm is two steps: map and reduce. Map takes in as input a key and value set and produces a list of intermediate key-value pairs. Reduce takes the list and reduces it down to a list of values. These are the functions that are written by the user to solve their particular algorithm while the rest of the details are handled by the underlying system. The user also creates a specification containing input/output names and parameters that is fed into the MapReduce library. The paper goes on to discuss at high level a MapReduce implementation that runs on large clusters of commodity PCs. A master node takes the input and splits it into small chunks. It then assigns worker nodes to run map on the chunks. After they finish, the master then partitions the results and passes the intermediate key-value lists to worker nodes to run reduce functions in parallel. Fault tolerance is a major issue taken into account given the large number of nodes being used; in the case of worker node failure, the master will simply reschedule the workload on a new node and continue on. For master failures, it would be possible to restart from checkpoints, though Google engineers found it was easier just to fail and restart the computation since such failures were rare. The paper outlines a number of optimizations and refinements including exploiting network locality and handling straggle nodes by rescheduling certain partitions on new nodes. MapReduce has good performance and is used widely at Google.

Applications/Thoughts:
The paper discusses a number of interesting applications of MapReduce, including functions used for web search indexing, like creating reverse web-link graphs and inverted indexes. However there are some tasks that would be better handled by a full database system.

Posted by: Ryan Johnson | April 15, 2010 01:51 PM

This paper presents map-reduce, two concepts from functional programming, as it is used at Google. The map-reduce model works by users providing a map function which processes key/value pairs to generate intermediate key/value pairs, and a reduce function which takes the intermediate value pairs and merges all values associated with the same key. Tasks that fit into the map-reduce model are easily parallelized without the programmer having to be concerned about the parallelization. Google's implementation of map-reduce runs on a cluster of inexpensive commodity hardware. These computations are scalable and process huge amounts of data (sometimes many terabytes across many machines.)

Map-reduce is a fairly simple programming model that is particularly effective for certain operations. Some things it is useful for include count of URL access frequency and creating an inverted index. Like Amazon with Dynamo, Google saw a specific problem where there certainly were existing solutions (databases can offer the same functionality) and chose to create their own in-house specific solution. Their implementation of map-reduce has to be scalable to many machines in a cluster and potentially multiple clusters. It has to tolerate faults while running (with such a large number of machines running commodity hardware, failure during execution is extremely likely.)

This paper provides a good overview of the map-reduce functionality but it is not very technical. Map-reduce is a good way to parallelize the computation for certain problems. These problems tend to not require perfect answers. If the number of people to visit a URL is off by a few counts, there will not be any repercussions. That is why map-reduce may be appropriate in some instances while use of a database is appropriate in others. You would never consider storing bank records as pairs that you then summed together (in the reduce phase) to get a user's balance.

Posted by: Elizabeth Soechting | April 15, 2010 01:37 PM

Summary:
This paper presents a general interface map-reduce to handle a major type of parallel programs on clusters.

Problem Description:
In Google, people need to write many different parallel programs to handle various tasks. To write highly scalable parallel problems while taking fault tolerance/locality(etc) into consideration is painful. The author recognizes that by using the map-reduce model, many parallel programs Google deals with can be generalized. Instead of handling hundreds of individual cases, it provides a general interface to use.

Contributions:
(1) It recognizes that many problems that have such a property: work on an input(the input can be represented a set of (key, value)), reorganize the input(manipulation of the keys, values. In the functional language such like Lisp, they can be naturally modeled as map, reduce function. A parallel programming model can be thus provided to parallelize such calculations.
(2) A powerful/natural interface for programmers to use. Map is to divide the input into equally size of chunks. There is an intermediate step to emit the raw input to (key, val) sets. Reduce is to execute the customized reduction operation on the (key, val) sets (from the intermediate step) to generate the final (key, val) set. Theoretically, map and reduce are the only two interfaces that programmer need to provide.

(3) Keep the complexity of fault tolerance from programmer. Runtime monitors the status of worker node, re-execute the task if a node fails to response within a period of time.
(4) Take the advantage of locality, the scheduler could make the input files local to the mapper and the intermediate results local to the reducer.

Applicability:
The paper has said that hundreds of MapReduce programs have been implemented and running in Google. As a programming model, I feel it is a successful one, because it is easy to program and it covers a substantial number of problems. It works very well on clusters(as it doesn't require complicated communication) and can solve Google's problems(mostly embarrassingly parallel). I think the key to map reduce's success is that it abstracts at the right level, if it becomes an "easier" abstraction, then many programs won't be modeled by it; it if becomes a more complicated abstraction, then the framework would be very hard to get right.

This programming model, however, has limitations, such like it doesn't have strong expressive power, i.e., it cannot work on parallel programs which require complicated synchronization. It would be nice that someone could do some study to show how much much a percentage of parallel programs can be modeled using MapReduce. For example, people can compare the "13 dwarfs" proposed in Berkeley's position paper regarding parallel programming and see how many dwarfs can be modeled by just using Mapreduce. I won't say it will show the weakness of map reduce, because mapreduce is not designed to replace the current parallel programming paradigm. But it might show map-reduce is quite general despite the simple interface.

Posted by: Wei Zhang | April 15, 2010 01:34 PM

The paper presents MapReduce, a programming model for processing huge amounts of data in parallel on cheap, commodity machines. Tons of data are periodically processed to generate differend kinds of statistics or for performing other computations. The inherent possibility for partitioning this data enables systems to perform tasks in parallel on large clusters. The authors indicate that there are lots of such data sets and associated operations performed within Google which motivated the design and implementation of MapReduce.

The goal of MapReduce is to divide the computation into many smaller, independent tasks and run them in parallel on hundreds of machines in a cluster, and then merge the results into. The challenges involved include efficient partitioning of data, proper scheduling across idle machines, handling machine failures, and to hide all of this from the programmer.

The input and output are both a set of key/value pairs. The user provides a map and a reduce function. The map function produces a set of intermediate results, which are then combined by the reduce function to produce the final result expected by the user. One instance of the user program acts as a master which creates map / reduce workers as needed. The master also acts as the intermediary for communication between the map / reduce workers. The intermediate results are stored locally on the difk, while the final results are stored in GFS.

The key contribution of the paper is providing a simple abstraction to the programmer to perform parallel data processing across hundreds of commodity machines without exposing the complexities involved. The authors have also implemented the design and the paper shows it works well for various tasks within Google. Some interesting things in the paper include the detail about stragglers. I dont understand why this effect should kick in only towards the end of the operation. They also have a nice optimisation of skipping bad records on repeated failures, which is a reasonable approach in the presence of a very large data set. The strategy of scheduling workers on machines which are close to the input data is an interesting optimisation and this is enabled by the homogenous nature of the hardware.

The system is applicable to different types of statistic-mining tasks, indexing data, etc. However, it doesnt seem to be suitable for performance critical tasks as it is hard to put an upper bound on the time it would take to finish the operation because of the distributed nature of the computation. Since the map / reduce tasks could run on the same machines (as different processes), could the intermediate buffering of results in memory be explioted to avoid the reading of data from the disk and transferring them using RPCs ? It would be interesting to discuss how the scheduling of workers happen, whether the number of workers are dynamically adapted according to the load, or whether it is started on all machines at the beginning.

Posted by: Deepak Ramamurthi | April 15, 2010 12:59 PM

Summarization:
This paper proposed a restricted programming model, MapReduce, and designed and implemented a framework that is capable of parallelizing MapReduce tasks on a large scale of machines. Fault tolerance is carefully incorporated into the framework so that neither failed machines nor bad records will influence the progress. Locality is also considered for high performance. The result shows that this framework could be applied in a variety of jobs successfully.

Problem Description:
The problem this paper tries to solve is to build a distributed computation framework for a specific programming model: Map and Reduce. Although the concept of map and reduce has been used in many programming languages, this paper actually found the property that MapReduce is an ideal parallel programming model in that no data dependency exists in two stages. The authors built the first distributed framework and successfully tested huge amounts of problems on this framework.

Contributions:
First, this paper proposed a very practical programming model which is intrinsically easy to parallelize. MapReduce divides the problem into two stages, where in each stage sub-problems are completely independent with each other. This independence is achieved by dividing data from two different dimensions in two stages: in mapping stage data is divided by “chunks”, and in reducing stage data is divided in “keys”. It is exactly like a basis transformation, transforming the data from chunk space to key space. The paradigm itself is a successful abstraction, and the simplicity nature of the problem itself makes MapReduce scalable and tolerant to failure

Second, the implementation of MapReduce takes full advantage of locality. Many of the workloads on MapReduce framework are I/O bound, and with no locality, MapReduce will not have big performance gain. In their implementation, locality has been leveraged in several different ways. First, the scheduler knows the exact location of each input file, so that it could try to schedule the mapping task to the machine where the input file is stored. Second, the intermediate result is stored on local disk and reducing could directly use part of the data locally.

Applications/Limitations:
The programming model of MapReduce is successful because it could be applied in a variety of situations, from simple word counting problems, to complicated machine learning problems. In fact, as long as the problem has the property of “calculation on the list of values associated with a key”, it could be done by using MapReduce.

MapReduce is a successful abstraction of many real world applications, but it is not suitable to be implemented in all environments. For example, while we are doing our course project, we find that MapReduce is more suitable to work with GFS than other network file system, like AFS. GFS has the nature of splitting large files into chunks and storing them distributedly on different machines, so locality could be leveraged. On the contrary, it is difficult for each node to read a portion of a large file without caching the complete file on AFS, so much of the bandwidth is wasted.

There are lots of papers comparing the performance of MapReduce with parallel DBMS, and pointing out the inferior performance of MapReduce. I want to say that this comparison is not fair. I believe the goal of MapReduce at first is not to compete with these complex database systems, but to accomplish highly structured and time-consuming tasks, such as word frequency counting. We shall not consider MapReduce an all-around miracle solution to all problems, but only a highly specified framework dealing with only single type of tasks.

Posted by: Shengqi Zhu | April 15, 2010 10:31 AM

Problem
In distributed system, even simple task requires huge amount of code. Without proper middleware, each task has to invent its own way to distribute jobs, tolerate faults and, move data between nodes. There have been middleware frameworks but they tended to be general framework instead of specialized programming models. Companies such as Google have tons of data that has inherent parallelism. The problem addressed here is how to easily program to compute/process these ‘inherent’ parallel data.

Summary
The paper describes implementation of map/reduce in distributed system. The key idea is that map/reduce job has inherent parallelism so each input data can be partitioned in any way and can be processed in parallel. Runtime system takes care of all sorts of problems that arise in distributed system such as job scheduling, worker failure, slow node. Also, the runtime system efficiently uses data locality. Rather than moving data around data-center, the runtime system assigns tasks to the node where data exists.

Contribution
While map/reduce severely restrict freedom of programming, the paper proved that there are many use-cases and it can be extremely efficient. With Google File System, the paper showed entire data processing – collect data, invent algorithms to fit into map/reduce model, calculate – can be easy job.
The biggest contribution I think is bringing restricted model into distributed system. Prior to map/reduce, I think most of people focused on building general frameworks such as MPI, Condor, distributed OS such as Amoeba, Sprite. Sacrificing some of freedom, it could leverage huge efficiency. Assigning map/reduce tasks to data node is quite interesting concept.
Other than that, almost every idea in this paper is actually burrowed from another paper. The paper just showed a perfect mix of them.

Applicability/Flaws
One of the flaws is that it only allows single data flow. Unlike DRYAD and Pig Latin, Google Map/Reduce is useful only if there is single stage of map and reduce. If the job is consisted of multiple map/reduce stages, only way to do in map/reduce is to run each stage sequentially.
Good thing is still many job can be computed in single stage. Especially, log analysis and data mining perfectly fit to map/reduce programming model.

Posted by: MinJae Hwang | April 15, 2010 10:23 AM

Map-Reduce is a programming model used to perform parallel tasks efficiently. Since we discussed the datacenter as a computer recently, this paper follows up with a simple method to perform many useful tasks. Sorting, machine learning, and data processing is very simply adapted to using Map Reduce. This paper is from Google, so it is important to keep in mind that the implementation is suited to the infrastructure that Google has set up. That being said, as many people are mimicking Google these days, tasks that are suited for Google are also useful in many other places.

Especially with systems that use commodity hardware, it is a problem that each individual node cannot do all that much work individually. Parallelizing tasks is usually a fairly difficult problem. By presenting a general solution (mapping and reducing) to making tasks run in parallel, all sorts of problems can be solved easily. The other problem that needs solving is performance under failure. As it is shown, failure of everything except the master is tolerated under the Map-Reduce algorithm. Getting optimal results from a distributed algorithm is really a problem, but Map-Reduce really just tries to break the problem into subproblems such that each one can be performed efficiently. The performance is measured, so getting to how well the system performs is very well defined.

It is not compared against any other system that can perform similar tasks, so it is presumed to be as good as it gets for this type of task. Several optimizations are presented that seem reasonable, but I'm not sure I'm buying the "it works so it must be good" argument. Besides that, the paper presents some information on how the Map-Reduce model is well suited for the environment that is commonly used, and a brief analysis of the scalability helps some. This is one of the better papers we have read in that they don't leave any loose ends laying around, but to some extent I was left wondering what they added to this problem space.

Obviously this system has real world applicability. Google uses it every day, and as shown by the pretty graph, more and more things are using it all the time. Pinning problems down into Map-Reduce is a very handy technique, and figuring out what tasks can be done this way is where the applicability comes in. As we move towards the cloud model and compute cycles are a more nebulous idea that need to be abstracted, the answers given by Map-Reduce will greatly help with solving the new problems. This paper does a good job of presenting the hows and whys of Map-Reduce, so all that remains is what will continue to happen in this area.

Posted by: Jordan Walker | April 15, 2010 09:12 AM

This paper describes a programming model which handles terabytes data by a cluster of 2000 commodity machines and meanwhile, is designed for programmers without distributed or parallel systems experiences.

MapReduce tries to turn many common tasks with large amount of data into fine grained Map tasks and Reduces tasks. As the performance is found to be highly influenced by stragglers, MapReduce model needs to consider more about load balance, fault tolerance and so on. One benefit of MapReduce is that it can handle huge data set by using a cluster of commodity machines. So the requirement for hardware is low. The second benefit of MapReduce is that it hides the details of parallelization, fault-tolerance, locality optimization and load balancing from users. So the requirement of the related experience of parallelism and distributed systems is also fairly low. The purpose of MapReduce is to make users lead a easier life, however, these problems related to locality, load balance, fault tolerance, parallelization as well as performance are required to be handled by MapReduce model itself.

MapReduce is cheaper than traditional database management system. It partitions all the data set into 16MB to 64MB chunks. This is kind of like divide and conquer idea. This in fact lowers requirements for machines. Also I think MapReduce has higher fault tolerance than traditional database system. If the job fails, the traditional database system has to start over from the beginning or the most recent checkpoint. However, for the MapReduce, every task is small and the cost for the restart of one task is small and meanwhile, if the machine fails after the Reduce task completes, this Reduce task is not necessary to be redone as Reduce task has already stored the result in the global files. MapReduce also benefits the indexing system. Thanks to the divide and conquer idea used here, indexing code is easy, direct and small,. The maintenance of indexing system also becomes easier. For each update, it takes much less time. The easier and clearer structure of indexing systems leads to obvious improvement in operating and scaling.

This paper is directly and openly talking about the ideas of MapReduce programming model. It is easy to follow. It also offers enough examples to convince us that MapReduce model suits many kinds of normal programs which are simple but need to handles large amount of data. Even most of functions are quiet related to web search engine applications used in Google. However, it offers example set covering text and number handling and seems to clear enough for me to understand the basic idea of MapReduce. So in general, I enjoy reading this paper very much and rather would like to try some interesting ideas of this paper in our own MapReduce project.

Posted by: Lizhu | April 15, 2010 08:11 AM

Problem addressed:
=============
This paper describes MapReduce programming paradigm which allows user to easily make use of parallelism available in large cluster of machines without actually worrying about writing parallel code. It does restrict the programming model to ensure that programs written following the restricted model can be automatically and easily parallelizable, though.

Summary:
========
The basic programming paradigm of MapReduce requires programmer to specify map and reduce tasks. The MapReduce infrastructure splits user given input file and invokes map task over those chunks on several nodes simultaneously. The map function produces intermediate key-value pairs and many reduce tasks are then invoked on those intermediate key-value pairs. The MapReduce infrastructure thus takes care of managing the parallelism and also provides fault-tolerance by monitoring the workers executing map/reduce tasks. Thus this leads to two primary benefits - 1) programmer can take advantage of parallelism and distributed execution framework without actually writing parallel code. 2) It clearly segregates the code implementing actual functionality from the code providing parallelism and fault tolerance and thus makes it a better s/w engineering practice. It provides fault tolerance by passive monitoring of health of the workers through time-outs and reassigning tasks of the failed nodes to other healthy nodes. Also, it gets around the problem of few slow tasks impairing the performance of the whole job, by starting up backup/redundant tasks when it detects such possibilities.

Relevance/Concern:
===============
Certainly, this work has important practical implications, given many major tasks inside Google are using this framework. Biggest promise of this work is to allow user to harness the power of large cluster and parallel machines without going through the pain of writing correct and efficient parallel program. One more advantage of using such common framework to parallelize many jobs is that any improvement in the framework can help all the application using the framework. But obviously downside of this paradigm is that it restricts what programs can leverage this type of model. Another concern that I have is that, if the amount of work required to process a given key/value pair is very much data-dependent, i.e. amount of work depends on what data being processed then there is a chance of unbalanced work distribution impacting the overall performance.

Posted by: Arkaprava Basu | April 15, 2010 08:00 AM

Goal:
MapReduce use a functional-style language to express computation. This method allow program to be parallelized and scheduled easily on large cluster of machines. Scheduling algorithm is locality-aware to minimize network traffic and it is also fault-tolerant by re-scheduling failed tasks.

Problem:
It is very hard to develop a highly parallel program especially in a distributed system because of the likelihood of failure. Thus, if these problem can be tackle separately or there is a platform that allow computation to express in more natural way, then application complexity can be reduced dramatically.

Contribution:
There are two major contributions in the MapReduce work. Firstly, restricted parallel programming model similar to functional language allows a certain class of applications to be expressed in a concise manner. Input data is partitioned and processed via map function and then the reduction stage (reduce task) compute the final output. This method also allows applications to scale really well without changing the algorithm even when running on a large number of machines.

Secondly, the master-worker style scheduling is fault-tolerant and optimized for efficiency. In term of fault-tolerant, master monitor the status of it workers and re-submit tasks if certain node fails. It relies on GFS to utilize local disks in a reliable manner. It also speculatively re-submit pending tasks during end stage in order improve overall performance. In term of efficiency, it use locality-aware algorithm to submit reduce task to node that is closest to the data to minimize network consumption.

Applicability:
Although MapReduce model of parallel programming is limited to a certain class of algorithm, it is flexible enough to support many kind of application even in scientific community (via Hadoop). This also expands option of parallel programming on large cluster of machines which is used to be limited to just MPI. On the Distributed Computing side, MapReduce shows how well-designed framework can hide distribution, heterogeneity and failure of distributed system, so that application programmer can easily exploit the power of Distributed Computing.

Posted by: Thawan Kooburat | April 15, 2010 07:50 AM

Summary:
This paper describes how google utilize a large amount of cheap machines to solve its daily problem. The problems that can be solved by Mapreduce share a common characteristic: the problems can be divided into serveral low-coupled sub problems, of which each one subproblem has no dependent on the others. Much like functional programming, Mapreduce users only need to implement two interfaces -- Map & Reduce. The Google infrasturcture can do all other works (e.g. partitioning, dispatching sub task, combining intermediate results).

Problem:
With the processor becoming faster and faster than the memory bandwidth, what we start to care regarding the performance has transitioned to how to better parallize the tasks. What Goolge tries to solve with Mapreduce is to use many cheap machines to complete the work that used to be done by a super-powerful but much expensive computers.

Contribution:
Google Mapreduce borrows the idea of functional programming paradim to parallize the tasks with a computer cluster of thousands of machines. Their main contribution are as follows:
- Fault Tolerance: In Google Mapreduce, masters can re-executes the tasks in the same machine or on another machine.switch the task to
- Locality: Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack
- Speed up the map progress by redundantly executes “slow-moving” map tasks; uses results of first copy to finish

Questions:
How Google Mapreduce partition the data? It seems to depend on what kind of application is running on Mapreduce. The partition for Matrix mulplication should definitly differ from that for word counting.

Posted by: Deng Liu | April 15, 2010 07:37 AM

Summary:
Writing parallel programs and debugging them will be a difficult job because application need to take care of reliability, scalability and load balancing. The simplified data processing paper suggests that by using functional style programming tasks can be automatically parallelized and describe a framework that takes care of reliability, scalability and load balancing. The user of the framework have to write a map function that takes original key, value pair as input and produces intermediate key,value pair. The reduce function process this intermediate key,value pair and produces final result. The master node splits the data to process and assigns the partitions to different nodes in cluster. The intermediate result is stored locally and master keeps track of assigned nodes and passes the information to nodes doing reduce operation. The output of reduce is stored on global file system. To alleviate the problem of stragglers, the system creates background tasks when reduce operation is close to completion. The framework also provides certain refinements to basic map reduce to make debugging and programming easier like skipping bad records or providing status information.

The authors evaluate the system with sort and grep functions. The only thing I can make out from results is that backup tasks are helpful.

Contributions:
Provided flexible framework for map/reduce – Features such as letting users specify partitioning function and refinements such as providing counters to keep track of progress, combiners to reduce network traffic are pretty helpful.

Helps in writing easily parallelize code by taking care of fault-tolerance and scalability. Code becomes much easier to maintain because the application code will have fewer lines as the framework takes care of restarting in case of failures, scaling and load balancing/data distribution.

Relevance:
As the authors point out, map reduce is applicable to only a subset of problems. I am surprised that they use single master and restart all tasks from beginning in case of failure. Also the paper does not talk about selection of reduce hosts. It might be efficient to pick reduce nodes from set of map nodes (or the ones close to them) because this would save some traffic. Also, although the framework abstracts out the distributed aspects of the application, the user still need to know that its a parallel application and be able to separate out map part and reduce part. Embedding the parallel part in language itself, as DyradLINQ shows, might be more beneficial because the parallelization part is completely taken care by the framework. We, however, may be adding restriction on language to be used.

Posted by: Satish Kotha | April 15, 2010 07:30 AM

Problem Description:
How to automate processing of huge amounts of data parallely managing all the overheads involved in distributed parallel processing like node failures, dynamic scheduling without requiring the user to make decisions. Map Redauce tries to attack this problem by providing the user with a easy-to-use library or framework.

Summary:
MapReduce is a programming model for processing huge amounts of data. In this model, there is a map function that maps the input key/value pairs to intermediate key/value pairs and a reduce function that generates the final output from the intermediate key/value pairs. A master coordinates the parallelization of jobs among several thousands of workers and maintains the state of the jobs and the workers. MapReduce is capable of tolerating a large-scale worker failures. In case of failures, the job is re-executed by a different worker. Also MapReduce tries to use the locality principle when scheduling so as to avoid high network bandwidth consumption. It also uses backup executions to speedup the process. Other than the main functions ,it also offers several other useful features like partitioning function interface, combiner function interface, status information, counter interface etc.

Contributions
1) The parallel computing framework that the MapReduce provides the user so that he need not worry about the parallel computing overheads.
2) Nice techniques like backup execution, usage of locality to reduce network bandwidth, ability to tolerate large scale node failures etc.

Comments and applicability:
The paper was really an easy and interesting one to read. Since the user needs to have very less knowledge about parallel computing, this can be highly beneficial. Because of its flexibility and ability to handle multiple kinds of tasks the framework will be have a good demand.

Posted by: Raja Ram Yadhav Ramakrishnan | April 15, 2010 07:08 AM

Summary:
This paper presents an overview of the Google MapReduce programming model, explaining the various optimizations done in its implementation in Google and presenting a coarse grained evaluation on some simple workloads.

Problem:
How can you easily write a program that can be run on thousands of nodes? Can the system take care of details like reliability for you? If so, what kind of programs can be run this way?

Contributions:
Even though the paper lacked many details, I still found it an interesting read. And we dont remember the details anyway :)

The major contributions would be:
1. The MapReduce programming model, and how many applications can be remolded as MapReduce applications.

2. The performance optimizations done - The straggler problems and skipping ahead records were interesting.

3. The performance evaluation which shows that using commodity machines, you can get close to/beat top industrial benchmarks for apps like sorting.

Comments and Questions:
It was really fun to read the Dewitt column on why MapReduce is a "giant" step backward. It was even more fun to read the comments section.

I guess MapReduce demonstrates that it might be better to forget about already done research and start fresh when working on a problem.

However, MapReduce has only just been in use for some 5 years. The first databases would have been a pretty light screwdriver too. I wonder if 5 years from now, MapReduce will be carrying along complexities and extra code which will defeat ( at least for me ) the light-weight lets keep things simple- philosophy in which it was started.

Is some sort of reliability built into the master node? Maybe it is a collection of nodes pretending to be one node. The paper mentions that the master node failing is unlikely, but why would that be so?

It would also be nice if we could discuss some guesses on how MapReduce would have been implemented internally. From what I remember of the key-value store project, things can get extremely hairy when you do things like replication and fault tolerance yourself. For example, the TCP REJECT message was screwing things up when we tried it. Did Google just get around TCP complexities by using UDP?

Posted by: Vijay Chidambaram Velayudhan Pillai | April 15, 2010 07:06 AM

Problem :-
The paper addresses the problem of creating a simple programming model for programs using really large datasets that are inherently parallel in nature and require to executed simultaneously across a large number of machines. Also, writing such parallel programs manually for doing simple tasks is painful because the programmer has to take care of issues such as machine failures, distributing the data and load amongst equally amongst different nodes.

Summary :-
The paper presents Google's MapReduce programming model for writing parallel programs to process large datasets. MapReduce takes care of the underlying parallelization issues, making it easy for the programmer to write programs that process massive amounts of data. The paper describes the architecture of the MapReduce model : multiple workers performing the Map and Reduce tasks and a Master worker that assigns the Map/Reduce tasks to other workers. It also talks about the other implementation details of the MapReduce model such as handling node failures, optimizing the network usage through locality and performing the same Map or Reduce tasks on multiple machines to mitigate the effects of slow nodes.

Contributions :-
The main contribution of the paper is providing a simple interface for writing parallel programs while abstracting the underlying details from the programmer. This makes it easy for the programmer to write and run parallel programs on huge clusters. It also discusses a variety of techniques used in MapReduce to make it robust and efficient. It presents a fault tolerance mechanism to handle failures while running a program on thousands of machines to ensure progress. The MapReduce master uses locality information to conserve network bandwidth by ensuring that most data gets read locally. It also provides load balancing techniques and mechanism to deal with "straggler" nodes. Overall, it provides a really useful paradigm for programmers of distributed systems to write simple programs for a large number of problems.

Other Comments :-
MapReduce started a flurry of activities in this area and many other systems followed MapReduce to provide a simple interface for writing programs to run on large clusters. Compared to systems such as distributed/parallel databases, MapReduce is a comparatively lightweight and is a better choice when the problem is relatively simple. But, for complicated operations such as joins, the onus is on the programmer to implement the logic and he may find it easier to use a more powerful interface, in this case a parallel database.

Posted by: Ashish Patro | April 15, 2010 06:57 AM

**MAP REDUCE**
==============

* Summary*
---------
This paper discussed a new programming model specilized for processing large data set in distributed systems with thoudsands of node. This new abstraction keep the programmer from the messy details of underlying complex distributed systems, let them to the runtime library.

*Problem*
--------
Writing a parallel/distributed program is hard due to the nature of distributed system and unpredictable failures. In general, a parallel program need to deal with a lot of problems such as data distribution, failure handling, load balancing and so one. This paper introduced a new programming paradigm that high those complexity from the programmer, which is a nice idea.

*Contribution*
-----------
I think there are two major contributions in this paper. First is the new programming abstraction to express simple computation. The programmer does not have to pay attention to the physical thing of underlying system, such as how may nodes are there in the system, what needs to be done in case of failure, how to distribute data, etc. All he needs to do is to specify a *map* and *reduce* function, specify the input data, and let the library/runtime system do the magic (cool!).

The second contribution of this paper is the runtime system which implement the new abstraction. The runtime system leverages the GFS to store data. This runtime system deal with failure by a bunch of simple but useful mechanism. For example, pinging is used to detect worker failure, checkpointing is used to recover in case of master failure. Locality is also considered when scheduling task (i.e preferring moving task to data).

*Flaw/Comment/Question*
---------------------
I feel map reduce program will fit for a certain kind of problems, but it is not a universal solution to any problem. And to me, it is somewhat weird that the workloads used for evaluation are pretty simple: just sort or merge. Perhaps more sophisticated/complex workload should be used to desmonstrate MapReduce applicability.

Also, it is unclear to me that how the system locate straggler and schedule back-up task, and how the scheduler make the scheduling decision. It seems that locality is the factor that driven scheduling decision, but aren't other information such as (CPU, Memory, Disk capacity) also be useful.

Posted by: Thanh Do | April 15, 2010 05:31 AM

The paper describes MapReduce, which is a programming model allowing developers to write programs that can be executed in a massively parallel environment without having to worry about any of the details. It requires that the program can be written in the form of two functions: a map function, and a reduce function (hence the name).

The problem that the paper addresses is that it is hard to write correct and efficient programs for clusters and very large datasets. In addition, although the specific problem may be different in each case, the majority of the code does not need to be (for example, the parts for assigning tasks and checking to see whether they have completed). Requiring developers to rewrite the code for the gather and scatter is a waste of time and likely to introduce bugs. The problems that MapReduce is designed to handle are of a specific form that allows them to be easily partitioned.

The contribution of MapReduce is a removal of the irrelevant details from the problem that the developer needs to solve. With it, all that needs to be done is to formulate the problem in two parts: a map function, which takes as input key/value pairs, and outputs an intermediate key/value pair. The reduce function takes these intermediate pairs, and merges them to obtain the final result. MapReduce also sorts the values by intermediate keys. Many problems can be represented this way, allowing programming to be much easier and faster (since all that is needed is these two functions). The paper also has the idea of starting redundant copies of tasks that are slow to complete near the end of the computation. Normally, the map and reduce tasks are split up among the available machines based on locality and availability, and only reassigned if the first try fails. However, if a task is taking a very long time (like if you disable the processor cache on some machines...), it can significantly hurt the overall time to complete. To get around this, MapReduce will start extra versions of straggling tasks, to prevent slow machines from delaying the completion.

MapReduce is obviously very applicable to certain classes of problems, like some sorts of data analysis of very large datasets. However, if you cannot coerce the problem you are trying to solve into a map and a reduce piece, you would not be able to apply MapReduce at all. So, although it is useful for some things, it is not flexible enough to be a solution to every problem.

Posted by: Lena Olson | April 15, 2010 04:46 AM

Summary:
This paper describes Google’s MapReduce programming model for distributed systems. MapReduce is used internally by Google to accomplish a wide variety of data processing tasks. MapReduce combines the “map” and “reduce” primitives that exist in many functional languages to create a versatile tool for data-parallel processing.

Problem:
Distributed programming is hard. Writing an application that can scale to multiple machines with good load-balancing, reliability, synchronization, and throughput is very difficult even for seemingly trivial applications. MapReduce provides a simple set of programming hooks that programmers can use to write data-parallel applications that are executed efficiently on clusters of machines. MapReduce takes care of the data partitioning, scheduling, synchronization, and fault tolerance issues all in one fell swoop, leaving the programmer to only worry about the data processing element of the problem.

Contributions:
The paper describes MapReduce, a highly effective programming model for data processing. The paper offers a variety of uses, and the evaluation demonstrates high throughput, good load balance, and excellent fault tolerance. MapReduce is also highly programmable. The popularity within Google demonstrates its ability to assist with the challenges of writing distributed applications.

Applicability:
Few distributed programming frameworks have been as influential as MapReduce. The key/value data store is pervasive in data centers, and MapReduce has proven to be a highly versatile tool for processing and manipulating key/value data, making it a natural fit for data center workloads. From my own experience, however, I can say that MapReduce only really excels and data processing and data transformation tasks. It is a poor fit for virtually all other tasks. Therefore, I don’t expect MapReduce to ever become a “general” programming framework, even with future extensions. It is, and will always be, a data transformation tool first and foremost.

Posted by: Marc de Kruijf | April 15, 2010 04:28 AM

This paper presents about MapReduce, a programming model to process large data sets in a highly parallel environment by performing execution over thousands of nodes available in the cluster. The basic idea is somewhat similar to the divide and conquer approach. The processing of large data set is divided into smaller chunks which are carried out in parallel.

The reduce operation has to wait till all the map operations has completed its work. This implies that the reduce waits till the slowest map operation. Backup tasks are used to alleviate this problem by running a map operation on different node over the same data. The framework also provides a fine granular fault tolerance. Writing intermediate results by the map operation onto the disk is direct implication of this decision.

Few challenges exists like the possibility of parsing inputs taking a considerable amount of time, ensuring locality to avoid/minimize the network transfer latency (through scheduler which is aware of locality of nodes), choosing appropriate hash function to partition equaly among R partitions, usage of input that is needed rather than scanning the entire file.

Contributions
Extensibility and Flexibility - Ability to accept any type of inputs. If the reader interface can be further extended, different types/formats of data can be read and fed to a single work. (Similar to the LINQ APIs). The usage of hash key to perform the partition can be changed as per the data to avoid the data skew.
Ease of development - The complexities like load balancing and failures are hidden. Only the application logic need to be written by the developer

Question
Is it possible to increase locality by scheduling the map and reduce operations on the same node? (Given the fact that map operations performs writes of the intermediate files to disk. Data might still be available in memory)

Posted by: Sankaralingam Panneerselvam | April 15, 2010 04:22 AM

Summary:
If only I had a nickel for every time I wrote a review of MapReduce paper. This paper describes the MapReduce programming model using some examples and discusses one of possible MapReduce implementations. Jeff Dean also includes some use cases of MapReduce at Google and some performance measurements. Granted Google's secreteviness, I sincerely doubt that either the performance measurements or even the use cases reflect Google reality. Nevertheless, the paper is valuable for its discussion of the programming model.

Problem:
The major problem that underlies the need for MapReduce is the need to perform tasks on a very large data set. Fortunately, the data set processing is often amenable to parallelization through partitioning. Chunks of data can be processed in parallel and then results combined. The classic example given in the paper is the problem of counting the number of occurrences of each unique work in a large dataset. There are other examples, most of them related to search engine functions.

Contribution:
The particular contribution of this paper is not introduction of MapReduce. Many would argue that Google did not invent MapReduce. However, Google has clearly done a tremendous job of refining MapReduce to suit the particular needs of the company. That goes to show that there is value in refining previously discovered technologies. This paper in particular discusses a potential implementation of MapReduce suited to a large cluster of commodity PCs. Most of the details were very vague, but I found the problem of stragglers to be interesting. I found this paper to be very hard to read just because really it is a listing of features that Jeff Dean decided were OK to disclose. Many details are missing. For example, on locality, Jeff Dean says that they try to schedule tasks based on data affinity. How they do it is not discussed. And practically all the points brought up are similar.

Flaws/Comments:
Many would argue that the value of MapReduce is its simplicity. One could also argue that the simplicity comes from the nature of the data being processed. Google recognized its data as such and was able to take advantage of the imprecision tolerance offered by the application, which is web-search.

Posted by: Polina | April 15, 2010 03:45 AM

Summary:
This paper describes the Google MapReduce implementation and abstraction, the workhorse behind much of Google's giant-scale data processing. Inspired by the map and reduce idioms present in many functional languages, MapReduce extends this model to work on huge clusters of unreliable commodity machines.

Contributions:
The throughput provided by MapReduce is very impressive. Normally one only sees 30GB/s sustained processing from disk in highly specialized, high performance systems. Google's MapReduce benchmark uses 2000 commodity machines, with a relatively low performance (very cost effective) network infrastructure. Even in the 1TB sort benchmark, which uses a simple implementation dependent on the properties of MapReduce itself, Google's implementation is competitive with the best current specialized TerraSort implementation.

MapReduce also leverages many properties of it's functional programming heritage for fault tolerance and optimization. The possibility of using an essentially unmodified reduce function to implement an intermediate combine function is easily shown to be correct using MapReduce's functional definition. Atomic commits also guarantee a lack of side effects, making fault tolerance (at the worker level) as simple as rescheduling blocks assigned to unresponsive workers onto different nodes.

Finally, MapReduce is extremely easy to reason about. The ordering guarantee (stating that the result of applying deterministic map and reduce operators is itself deterministic, even in the presence of failures) frees programmers from worrying about timing and failure in distributed systems.

Questions/Comments:
For very large MapReduce tasks, it seems that the single master model could become a bottleneck, particularly if M is very large. Does the CPU, memory or network become a limiting factor at the master first?

Posted by: Joel Scherpelz | April 15, 2010 01:17 AM

CS 739 Review Blog

MapReduce: Simplified Data Processing on Large Clusters

Comments

Post a comment