June 18, 2021
Here is the paper.
MapReduce is a programming model and an associated implementation for processing and generating large
data sets. Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key. Many
real world tasks are expressible in this model, as shown
in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the
details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any
experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large
cluster of commodity machines and is highly scalable:
a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers
find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters
every day.
My highlights:
- a large cluster of commodity machines
- who is taking care of the details of partitioning the input data, schedule the program's excution across a set of machines, ...
- Google implementation of MapReduce - ?
Over the past five years, the authors and many others at
Google have implemented hundreds of special-purpose
computations that process large amounts of raw data,
such as crawled documents, web request logs, etc., to
compute various kinds of derived data, such as inverted
indices, various representations of the graph structure
of web documents, summaries of the number of pages
crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually
large and the computations have to be distributed across
hundreds or thousands of machines in order to finish in
a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle
failures conspire to obscure the original simple computation with large amounts of complex code to deal with
these issues.
As a reaction to this complexity, we designed a new
abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution
and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp
and many other functional languages. We realized that
most of our computations involved applying a map operation to each logical “record” in our input in order to
compute a set of intermediate key/value pairs, and then
applying a reduce operation to all the values that shared
the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution
as the primary mechanism for fault tolerance.
The major contributions of this work are a simple and
powerful interface that enables automatic parallelization
and distribution of large-scale computations, combined
with an implementation of this interface that achieves
high performance on large clusters of commodity PCs.
Section 2 describes the basic programming model and
gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards
our cluster-based computing environment. Section 4 describes several refinements of the programming model
that we have found useful. Section 5 has performance
measurements of our implementation for a variety of
tasks. Section 6 explores the use of MapReduce within
Google including our experiences in using it as the basis for a rewrite of our production indexing system. Section 7 discusses related and future work.
My notes - challenges:
- Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time.
2 Programming Model
The computation takes a set of input key/value pairs, and
produces a set of output key/value pairs. The user of
the MapReduce library expresses the computation as two
functions: Map and Reduce.
Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them
to the Reduce function.
The Reduce function, also written by the user, accepts
an intermediate key I and a set of values for that key. It
merges together these values to form a possibly smaller
set of values. Typically just zero or one output value is
produced per Reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too
large to fit in memory.
2.1 Example
Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus an associated
count of occurrences (just ‘1’ in this simple example).
The reduce function sums together all counts emitted
for a particular word.
In addition, the user writes code to fill in a mapreduce
specification object with the names of the input and output files, and optional tuning parameters. The user then
invokes the MapReduce function, passing it the specification object. The user’s code is linked together with the
MapReduce library (implemented in C++). Appendix A
contains the full program text for this example.
Actionable Items
It takes hours to finish reading the paper. I have to get used to read a lot of articles more often.
No comments:
Post a Comment