SQL: Joins and Aggregate Queries

A single SELECT query can already retrieve, filter and sort rows from one table. But real questions rarely fit inside one table. "Which class is each pupil in?" lives across two tables. "What is the average mark in each class?" isn't about any single row at all — it's about whole groups of rows. This page adds the two SQL tools that answer exactly those questions:

a JOIN — stitches rows from two tables together by matching a foreign key to a primary key, so one query can pull from both;
an aggregate function (COUNT, SUM, AVG, MIN, MAX) with GROUP BY — collapses many rows into one summary row per group.

Together they turn a database from a filing cabinet into something you can genuinely interrogate. These are staples of A-level Computer Science, so it pays to see exactly which rows each query hands back.

Two tables to work with

Everything below uses these two little tables. The pupil table has one row per pupil, and its classId column is a foreign key — each value is the primary key of a row over in the class table. That shared value is the link between them.

Notice the tables deliberately store each fact once: the pupil table doesn't repeat "Maths", "Ms Lee" on every maths pupil — it just stores the tiny code C1 and lets the class table hold the details. A JOIN is how we glue them back together when we want the full picture.

A JOIN combines rows from two tables

Suppose you want a list of pupils with the name of the class they're in — the pupil's name comes from pupil, but the class name comes from class. An INNER JOIN matches rows from the two tables wherever the foreign key equals the primary key:

SELECT pupil.name, class.className FROM pupil INNER JOIN class ON pupil.classId = class.classId;

Read the ON line as the matching rule: "pair up a pupil row with the class row that has the same classId." Because column names now come from two tables, we write pupil.name and class.className — table.column — so there's no confusion about which table each field belongs to. Here is exactly what comes back:

Every pupil now sits next to their class name. Under the hood the database walked each pupil's classId across to the class table, found the matching row, and merged the two into one wide row. You can filter and sort a joined result just like any other query — add WHERE class.className = 'Maths' and ORDER BY pupil.name and only the maths pupils come back, alphabetically.

They hold the same value for matched rows, but they live in different tables and mean different things. In class, classId is the primary key — it identifies a class. In pupil, classId is a foreign key — it points at a class. The ON condition is precisely the instruction "line up the pointer with the thing it points at." That single idea — foreign key equals primary key — is the join at the heart of almost every relational query you'll ever write.

Aggregate functions summarise many rows into one

An aggregate function takes a whole column of values and boils it down to a single number. There are five you must know:

COUNT(*) — how many rows;
SUM(column) — the total of a numeric column;
AVG(column) — the mean (average);
MIN(column) — the smallest value;
MAX(column) — the largest value.

Used on their own they crunch the entire table down to one row. "How many pupils are there, and what's the average mark across all of them?"

SELECT COUNT(*), AVG(mark), MAX(mark), MIN(mark) FROM pupil;

With six pupils and marks of 72, 55, 88, 64, 91 and 60, that returns a single summary row: COUNT = 6, AVG = 71.7, MAX = 91, MIN = 55. The six individual pupils have vanished — an aggregate answers a question about the group, not about any one pupil.

GROUP BY: one summary row per group

Averaging all the pupils is a blunt instrument. Usually you want the average per class. GROUP BY sorts the rows into piles that share a value, then runs the aggregate once per pile:

SELECT classId, COUNT(*), AVG(mark) FROM pupil GROUP BY classId;

The database gathers all the C1 rows into one group, all the C2 rows into another, and so on, then reports the count and average within each group:

The golden rule of GROUP BY: everything in the SELECT list must be either the column(s) you grouped on or wrapped in an aggregate function. You can select classId (you grouped by it, so it's the same for every row in the group) and AVG(mark) (an aggregate), but you cannot also select a raw name — a group of three pupils has three different names, and only one row comes back, so which name would it print?

Joining and grouping together

The two ideas combine beautifully. To show the average mark next to each class's name (not just its code), join first, then group:

SELECT class.className, COUNT(*), AVG(pupil.mark) FROM pupil INNER JOIN class ON pupil.classId = class.classId GROUP BY class.className;

The join supplies the readable class name; the GROUP BY then rolls each class's pupils into a single summary row. This is the workhorse pattern behind almost every report you've ever seen: "sales per region", "goals per team", "orders per customer".

HAVING: filtering the groups

There are two different filters, and telling them apart is a classic exam question. WHERE filters individual rows before they are grouped; HAVING filters the summary rows after the grouping, so it's the one that can test an aggregate.

SELECT classId, AVG(mark) FROM pupil GROUP BY classId HAVING AVG(mark) >= 70;

"Show each class's average mark, but only for classes averaging 70 or more." You couldn't put AVG(mark) >= 70 in a WHERE — at the moment WHERE runs, the groups (and therefore the averages) don't exist yet. A neat way to remember the running order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY.

The same idea in code

A database does joins and aggregates for you, but they're no mystery — a table is just an array of records. Here is the join written in TypeScript: for each pupil, find the matching class and merge the two. Press Run and compare it with the SQL above.

interface Pupil { pupilId: number; name: string; classId: string; mark: number; } interface Klass { classId: string; className: string; teacher: string; } const pupil: Pupil[] = [ { pupilId: 1, name: "Ada", classId: "C1", mark: 72 }, { pupilId: 2, name: "Ben", classId: "C2", mark: 55 }, { pupilId: 3, name: "Chloe", classId: "C1", mark: 88 }, { pupilId: 4, name: "Dev", classId: "C3", mark: 64 }, { pupilId: 5, name: "Ivy", classId: "C2", mark: 91 }, { pupilId: 6, name: "Sam", classId: "C1", mark: 60 }, ]; const klass: Klass[] = [ { classId: "C1", className: "Maths", teacher: "Lee" }, { classId: "C2", className: "Physics", teacher: "Roy" }, { classId: "C3", className: "Biology", teacher: "Fox" }, ]; // SELECT pupil.name, class.className // FROM pupil INNER JOIN class ON pupil.classId = class.classId for (const p of pupil) { const c = klass.find((k) => k.classId === p.classId); // the ON match console.log(p.name + " " + c.className); }

And here is the aggregate: gather pupils into piles by classId, then average each pile. The Map is doing the job of GROUP BY.

Digging deeper

Two mistakes trip almost everyone at first — know them cold for the exam.

An INNER JOIN only keeps rows that match in both tables. If a class had no pupils, or a pupil's classId pointed at no class (or was blank), that row simply vanishes from an INNER JOIN — it has no partner to pair with, so it's silently dropped. Imagine adding a brand-new class C4 "Chemistry" with no pupils yet: the inner join above would never mention Chemistry at all. When you need to keep the unmatched rows, use a LEFT JOIN instead — it keeps every row from the left-hand table and fills the missing right-hand columns with NULL. So class LEFT JOIN pupil would list Chemistry with a NULL pupil, whereas class INNER JOIN pupil would hide it. A join that mysteriously loses rows is almost always an INNER JOIN discarding unmatched data.
Aggregate + GROUP BY collapses many rows into one per group — so you can't also select a raw, ungrouped column. Writing SELECT classId, name, AVG(mark) FROM pupil GROUP BY classId is a mistake: class C1 has three pupils with three different names, but the group produces only one row, so there's no single name to show. Every column you SELECT must be either something you GROUP BY or something inside an aggregate function. (Some databases quietly pick a random name and let it slide — which is worse, because the answer looks fine but is meaningless.)

Picture the two tables as overlapping circles (a Venn diagram). An INNER JOIN returns only the overlap — rows that exist on both sides. The OUTER joins reach beyond the overlap: a LEFT OUTER JOIN keeps the whole left circle (matched or not), a RIGHT OUTER JOIN the whole right circle, and a FULL OUTER JOIN both circles entirely. "Inner" means the inner overlap; "outer" means you also reach out to the parts with no partner. At A-level you'll mostly meet INNER and LEFT — but knowing they're the same machine with a different reach makes the whole family click.

A subtle but favourite exam point: COUNT(*) counts rows, but COUNT(column) counts only the rows where that column is not NULL. So in a table where two pupils have no recorded email, COUNT(*) might be 6 while COUNT(email) is 4. The same holds for AVG and SUM: they skip NULLs entirely, which means an average is taken over "the pupils who actually have a mark", not over every row. Blanks don't count — literally.