Parallel Programming Models

Picture a busy restaurant kitchen the night before it opens. There are two completely different ways to get a mountain of work done fast. In the first, the head chef hands ten cooks an identical job: "each of you take a pile of onions and chop it, exactly the same way." Ten piles shrink at once — the same operation applied across many pieces of data. In the second, the chef gives each cook a different job: one chops, one stirs the stock, one plates the starters, one washes up — different tasks running at the same time.

Those two pictures are the two oldest ideas in parallel computing — data parallelism and task parallelism — and almost every way we structure parallel work is a variation or combination of them. This page is a tour of these models: the high-level shapes we pour concurrent work into, sitting a level above the threads and processes that actually run it. We will meet data vs. task parallelism, two ways processes can talk (shared memory vs. message passing), the actor model, the fork–join pattern, and the law that spoils the party — Amdahl's.

Data parallelism vs. task parallelism

Data parallelism is the ten-cooks-chopping-onions model. You have one operation and a large collection of data; you apply the operation to every element, and the elements do not depend on each other, so they can all be done at once. Doubling a million numbers, brightening every pixel in an image, running the same physics update on a million particles — all data-parallel. It shows up as map, as a parallel-for loop, as SIMD (Single Instruction, Multiple Data — one instruction driving many arithmetic lanes), and most spectacularly as a GPU kernel, where thousands of tiny cores each run the identical program on a different slice of the data.

// DATA PARALLEL: one operation ("double it") applied across every element. // Every lane is independent, so a runtime can spread them over many cores. const data = [3, 1, 4, 1, 5, 9, 2, 6]; const doubled = data.map((x) => x * 2); // conceptually all at once

Task parallelism is the other kitchen: different jobs, run concurrently. A web server that, for one request, kicks off "query the database", "call the payment API" and "render the template" at the same time is task-parallel. So is a thread pool chewing through a queue of heterogeneous jobs. Here the parallelism comes not from splitting data but from the fact that the tasks themselves are independent of one another.

The two are not rivals — real programs mix them. A video encoder might run task-parallel stages (read, transform, compress, write) where the transform stage is itself data-parallel across every pixel block. The question a model answers is: where does the independence live — in the data, or in the work?

How do the workers talk? Shared memory vs. message passing

Once several workers run at once, they usually need to coordinate. There are two fundamentally different communication models, and choosing between them shapes everything else.

In the shared-memory model, threads live in one address space and simply read and write the same variables. Communication is implicit: I write total, you read total. It is fast and convenient, and it is exactly the world of the previous pages — which is also why it needs locks and mutexes to tame race conditions. Sharing memory buys speed at the price of careful synchronisation.

In the message-passing model there is no shared memory. Each process keeps its own private data and the only way to communicate is to send a message — an explicit send and receive. This is the model behind MPI (the Message Passing Interface that powers supercomputers) and behind essentially every distributed system, where the "processes" sit on different machines and could not share memory even if they wanted to. There are no data races because there is no shared data to race over; the trade is that every exchange is explicit, and messages take time and can arrive out of order.

Shared memory — workers read/write the same variables; communication is implicit and fast, but needs locks to avoid races. Natural within one machine.
Message passing — workers own private state and exchange explicit messages; no shared data means no races, at the cost of explicit, latency-bearing communication. The only option across machines.

The actor model: state you can only reach by mail

The actor model takes message passing and makes it the only rule. The world is a swarm of actors, each a tiny independent unit with private state nobody else can touch and a mailbox. An actor does exactly three kinds of thing in response to a message: change its own state, send messages to other actors it knows, and create new actors. Crucially, an actor processes its mailbox one message at a time, so its private state is never touched by two things at once.

That single restriction is the whole point. Because no state is shared, there is nothing to lock and no data race is even possible — you have traded shared-memory hazards for the simpler discipline of "just send a message." It is the model behind Erlang and the Akka framework, and it scales beautifully: actors are cheap, so you can have millions of them, and because they only talk by messages, it does not matter whether the actor you message is in the same process or on a machine across the world.

Erlang was built at Ericsson to run telephone exchanges, which are allowed roughly a few minutes of downtime per century. Its answer was millions of lightweight actor-like processes, each isolated, each cheap to spawn — and a motto of "let it crash". If one process hits a bug, it dies alone without corrupting anyone else's memory (there is no shared memory to corrupt), and a supervisor simply restarts it. Isolated state turned crashes from catastrophes into a shrug. The same design now underpins WhatsApp, which famously handled millions of connections per server.

Fork–join: split, run in parallel, combine

Fork–join is the pattern that turns divide and conquer into parallel work. You fork a task into independent subtasks, let them run in parallel, then join — wait for them all to finish and combine their results. Because each subtask can itself fork, the pattern is naturally recursive, which is exactly why it fits divide-and-conquer algorithms like parallel merge sort, or simply summing a huge array.

Read the tree top-down for the fork: the whole array splits in half, and each half splits again, until the pieces are small enough to add up directly. Then read it bottom-up for the join: each pair of partial sums combines into its parent, and the totals bubble all the way up to the answer. The leaves are independent, so on a machine with several cores they can be computed genuinely at the same time; the joins impose the only ordering (a parent must wait for its children).

Run a fork–join sum

The sandbox below is single-threaded, so nothing here really runs in parallel — but that is fine, because the fork–join structure is what we are showing. The function splits the array, recurses into each half (these are the subtasks a real runtime would hand to different cores), and then joins the two partial sums. Watch the fork lines march inward as the work splits, and the join lines march back out as the sums combine. Press Run ▶:

// FORK-JOIN parallel sum. Each recursive call is a "task" that a real // parallel runtime could run on its own core. We fork into two halves, // (conceptually) run them in parallel, then JOIN their partial sums. function parallelSum(arr: number[], lo: number, hi: number, depth: number): number { const n = hi - lo; const pad = " ".repeat(depth); // Base case: a small chunk is summed directly (a "leaf" task). if (n <= 2) { let s = 0; for (let i = lo; i < hi; i++) s += arr[i]; console.log(pad + "leaf sum[" + lo + ".." + hi + ") = " + s); return s; } // FORK: split the range in half into two independent subtasks. const mid = Math.floor((lo + hi) / 2); console.log(pad + "fork [" + lo + ".." + hi + ") into [" + lo + ".." + mid + ") and [" + mid + ".." + hi + ")"); const left = parallelSum(arr, lo, mid, depth + 1); // subtask 1 const right = parallelSum(arr, mid, hi, depth + 1); // subtask 2 // JOIN: wait for both, then combine their results. const total = left + right; console.log(pad + "join " + left + " + " + right + " = " + total); return total; } const data = [3, 1, 4, 1, 5, 9, 2, 6]; const answer = parallelSum(data, 0, data.length, 0); console.log(""); console.log("TOTAL = " + answer);

The shape of the output is the tree from the diagram above: every fork has a matching join, and the deepest indentation is where the independent leaf work lives — the part that parallel hardware actually speeds up.

The reality check: Amdahl's law

It is tempting to think "twice the cores, twice the speed." Almost never true. Nearly every program has some part that is stubbornly serial — it cannot be parallelised (reading the input, a final combine step, a bit of coordination). Amdahl's law makes the consequence brutally precise. If a fraction p of the work is parallelisable and the rest is serial, then with N processors the speedup is

S(N) = \dfrac{1}{(1 - p) + \dfrac{p}{N}}

As N grows, the p/N term shrinks toward zero, so the whole thing is capped at S_\infty = \dfrac{1}{1 - p}. If even 5\% of your work is serial (p = 0.95), the most you can ever get — with a million cores — is a 20\times speedup. The serial slice, not the core count, is the ceiling.

Notice how flat the curves go. Each one races upward at first and then bends over toward its ceiling: more cores buy less and less, and the smaller the parallel fraction, the sooner the wall arrives. This is why real speedups are measured, not assumed — and why shrinking the serial fraction often beats buying more cores.

This is the single most common misconception in parallel programming, and Amdahl's law is its antidote. Speedup is not proportional to the number of cores; it is throttled by whatever fraction of the work stays serial. At p = 0.95 you are capped at 20\times no matter how many cores you throw at it, and you hit half that ceiling by around 20 cores — everything after that is diminishing returns. Before scaling out, ask "what is my serial fraction?" A 5% serial part is a 20× wall; a 1% serial part is a 100× wall. Reducing the serial slice is usually more valuable than adding hardware.

It removes one whole class — data races — because there is no shared state to race over. But it does not make concurrency easy; it trades one kind of hard problem for another. Instead of reasoning about locks and interleavings, you now reason about message ordering, mailbox back-pressure (what if messages arrive faster than an actor can process them?), and deadlocks of a new flavour (actor A waits for a reply from B while B waits for A). No model makes concurrency free — each one just moves the difficulty to a place you might reason about more clearly.