Suppose someone hands you a copy of the entire web — every page ever crawled, tens of billions of documents, hundreds of terabytes of text — and asks a childishly simple question: how many times does each word appear? The counting itself is trivial; a first-year could write the loop. The trouble is that the data will not fit on one machine. It will not fit in one machine's memory, it will barely fit on one machine's disks, and reading it end to end on a single computer would take days. The algorithm is easy. The scale is the whole problem.
The obvious escape is to buy a thousand machines and split the work. But now you own a new set of headaches: which machine gets which slice of the data? How do the partial answers get combined? What happens when — not if — one of those thousand machines dies halfway through? Coordinating a cluster by hand is miserable, error-prone work, and you would have to redo it for every new question.
MapReduce is a programming model that makes this whole class of problem routine. You write two small, ordinary functions — a map and a reduce — and a framework takes care of everything hard: splitting the data across the cluster, scheduling the work, moving intermediate results around the network, and quietly recovering from the machines that fail along the way. This page is about that model: what map and reduce actually do, why the shape of the computation lets it scale to thousands of machines, and how re-running work makes the whole thing shrug off hardware failure.
A MapReduce job flows through three stages. The programmer writes only the first and the last; the framework owns the one in the middle.
map function runs on each shard independently. For every input
record, map emits zero or more key → value pairs. Because no map
task needs to talk to any other, this phase is embarrassingly parallel: a thousand
shards can be processed by a thousand workers at the same time, with no coordination.
"the", wherever in the cluster
they were produced, end up together. This is the one phase that moves data across the
network, and it is where the real distributed-systems machinery lives.
reduce function receives the key and the
whole collection of its values, and combines them into a final result. Different keys are
independent, so reduces run in parallel too, one bucket at a time.
The canonical example — the "hello world" of the whole model — is word count. Watch a single line of text flow through all three phases:
"the cat sat on the mat" and emits one pair per
word, each with a count of one:
That is the entire program. You wrote "emit
Nothing about the model requires a cluster — a cluster is just an optimisation for size.
The code below is a complete, single-machine MapReduce for word count: a map that emits
shuffle that groups those pairs by key,
and a reduce that sums each group. The driver runs map over several lines
(pretend each line is a shard on its own machine), shuffles, then reduces. Press Run ▶
and read the phases in the output:
Change the sentences, add a shard, press Run again. The same two user functions handle it — that reusability, across every dataset and every counting-shaped question, is exactly the point.
Two design choices give MapReduce its power, and they are worth stating precisely.
Move the computation to the data. The input shards already live spread across the
cluster's disks. Rather than haul terabytes over the network to some central CPU, the scheduler ships
your tiny map function to the machine that already holds the shard and runs it
there — this is data locality. Code is kilobytes; data is terabytes. Sending the
small thing to the big thing, instead of the reverse, is what keeps the network from becoming the
bottleneck, and it is why map throughput grows almost linearly as you add machines.
Fault tolerance by simply re-running. Across a thousand machines running for hours, something will break — a disk, a power supply, a whole rack. MapReduce's answer is beautifully blunt: if a task's machine dies, the framework just runs that task again on another machine, from the original input shard. It can do this because map and reduce are required to be deterministic and side-effect-free — the same input always yields the same output, and running a task twice does no harm because it touches nothing but its own inputs and outputs. Re-running a failed task gives exactly the answer the dead task would have produced, so a handful of failures cost a little time, not correctness. The learner never writes a line of recovery code; determinism buys it for free.
Near the end of a job, a single sluggish machine — a "straggler", limping along with a failing disk or an overloaded CPU — can hold up the entire result while thousands of others sit idle. Because tasks are deterministic and side-effect-free, the framework can play a neat trick: speculative execution. It launches a duplicate copy of the straggling task on a fresh machine, and whichever copy finishes first wins; the other is simply discarded. The same property that lets MapReduce re-run failed tasks also lets it race slow ones — tail latency, tamed by redundancy.
If the shape of MapReduce feels familiar, it should. It is exactly the pattern of
map (the sub-problems don't talk to each other); and you
combine the partial results with reduce. Merge sort splits an array,
sorts the halves, and merges; MapReduce splits the web, maps the shards, and reduces the groups. Same
idea, different scale — one runs in a call stack, the other across ten thousand machines.
The reduce step, seen from a functional angle, is precisely a fold
(an aggregate): it collapses a list of values down to one, the way
[1, 1, 1, 1] folds under addition to 4. And that framing exposes a lovely
optimisation. If the reduce operation is associative and commutative — like sum,
max, or count, where the order and grouping of the combining doesn't change the answer — then a
combiner can run a "mini-reduce" on each map machine, before the shuffle.
Instead of shipping a thousand separate
This model is a natural fit for a whole family of computations; it is one point in the wider space of
Two tempting assumptions, both wrong. First: MapReduce is a batch model — it
grinds through a whole dataset and produces a whole result, taking minutes to hours. It is
not built for low-latency, interactive queries; asking it "what's the count of
cat right now?" and expecting a millisecond answer is a category error. Use it to
build an index overnight, not to serve one live.
Second: your reduce function must not assume the values arrive in any
particular order. The shuffle gathers values for a key from machines all over the cluster,
in whatever order they happen to turn up — so a reducer that expects, say, timestamps in ascending
order is silently broken. This is also why a good reducer is associative and commutative: it must
give the right answer no matter how its inputs are ordered or grouped. "Any function can be a
reducer" is false — a combiner-safe reducer needs those algebraic properties.
MapReduce was introduced by Google in a 2004 paper, describing the system it used internally to rebuild its search index over the whole web. The idea was too good to keep proprietary: an open-source clone, Hadoop, put MapReduce (and its companion distributed file system) into everyone's hands and effectively launched the "big data" era. Hadoop's weakness — it wrote every intermediate result to disk — was answered by Apache Spark, which keeps data in memory between steps and generalises map/reduce into a richer set of operations, running the same jobs orders of magnitude faster. The programming model you learned on this page is the through line connecting all three.