SQL: Joins and Aggregate Queries

A single SELECT query can already retrieve, filter and sort rows from one table. But real questions rarely fit inside one table. "Which class is each pupil in?" lives across two tables. "What is the average mark in each class?" isn't about any single row at all — it's about whole groups of rows. This page adds the two SQL tools that answer exactly those questions:

Together they turn a database from a filing cabinet into something you can genuinely interrogate. These are staples of A-level Computer Science, so it pays to see exactly which rows each query hands back.

Two tables to work with

Everything below uses these two little tables. The pupil table has one row per pupil, and its classId column is a foreign key — each value is the primary key of a row over in the class table. That shared value is the link between them.

Notice the tables deliberately store each fact once: the pupil table doesn't repeat "Maths", "Ms Lee" on every maths pupil — it just stores the tiny code C1 and lets the class table hold the details. A JOIN is how we glue them back together when we want the full picture.

A JOIN combines rows from two tables

Suppose you want a list of pupils with the name of the class they're in — the pupil's name comes from pupil, but the class name comes from class. An INNER JOIN matches rows from the two tables wherever the foreign key equals the primary key:

SELECT pupil.name, class.className FROM pupil INNER JOIN class ON pupil.classId = class.classId;

Read the ON line as the matching rule: "pair up a pupil row with the class row that has the same classId." Because column names now come from two tables, we write pupil.name and class.classNametable.column — so there's no confusion about which table each field belongs to. Here is exactly what comes back:

Every pupil now sits next to their class name. Under the hood the database walked each pupil's classId across to the class table, found the matching row, and merged the two into one wide row. You can filter and sort a joined result just like any other query — add WHERE class.className = 'Maths' and ORDER BY pupil.name and only the maths pupils come back, alphabetically.

They hold the same value for matched rows, but they live in different tables and mean different things. In class, classId is the primary key — it identifies a class. In pupil, classId is a foreign key — it points at a class. The ON condition is precisely the instruction "line up the pointer with the thing it points at." That single idea — foreign key equals primary key — is the join at the heart of almost every relational query you'll ever write.

Aggregate functions summarise many rows into one

An aggregate function takes a whole column of values and boils it down to a single number. There are five you must know:

Used on their own they crunch the entire table down to one row. "How many pupils are there, and what's the average mark across all of them?"

SELECT COUNT(*), AVG(mark), MAX(mark), MIN(mark) FROM pupil;

With six pupils and marks of 72, 55, 88, 64, 91 and 60, that returns a single summary row: COUNT = 6, AVG = 71.7, MAX = 91, MIN = 55. The six individual pupils have vanished — an aggregate answers a question about the group, not about any one pupil.

GROUP BY: one summary row per group

Averaging all the pupils is a blunt instrument. Usually you want the average per class. GROUP BY sorts the rows into piles that share a value, then runs the aggregate once per pile:

SELECT classId, COUNT(*), AVG(mark) FROM pupil GROUP BY classId;

The database gathers all the C1 rows into one group, all the C2 rows into another, and so on, then reports the count and average within each group:

The golden rule of GROUP BY: everything in the SELECT list must be either the column(s) you grouped on or wrapped in an aggregate function. You can select classId (you grouped by it, so it's the same for every row in the group) and AVG(mark) (an aggregate), but you cannot also select a raw name — a group of three pupils has three different names, and only one row comes back, so which name would it print?

Joining and grouping together

The two ideas combine beautifully. To show the average mark next to each class's name (not just its code), join first, then group:

SELECT class.className, COUNT(*), AVG(pupil.mark) FROM pupil INNER JOIN class ON pupil.classId = class.classId GROUP BY class.className;

The join supplies the readable class name; the GROUP BY then rolls each class's pupils into a single summary row. This is the workhorse pattern behind almost every report you've ever seen: "sales per region", "goals per team", "orders per customer".

HAVING: filtering the groups

There are two different filters, and telling them apart is a classic exam question. WHERE filters individual rows before they are grouped; HAVING filters the summary rows after the grouping, so it's the one that can test an aggregate.

SELECT classId, AVG(mark) FROM pupil GROUP BY classId HAVING AVG(mark) >= 70;

"Show each class's average mark, but only for classes averaging 70 or more." You couldn't put AVG(mark) >= 70 in a WHERE — at the moment WHERE runs, the groups (and therefore the averages) don't exist yet. A neat way to remember the running order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY.

The same idea in code

A database does joins and aggregates for you, but they're no mystery — a table is just an array of records. Here is the join written in TypeScript: for each pupil, find the matching class and merge the two. Press Run and compare it with the SQL above.

interface Pupil { pupilId: number; name: string; classId: string; mark: number; } interface Klass { classId: string; className: string; teacher: string; } const pupil: Pupil[] = [ { pupilId: 1, name: "Ada", classId: "C1", mark: 72 }, { pupilId: 2, name: "Ben", classId: "C2", mark: 55 }, { pupilId: 3, name: "Chloe", classId: "C1", mark: 88 }, { pupilId: 4, name: "Dev", classId: "C3", mark: 64 }, { pupilId: 5, name: "Ivy", classId: "C2", mark: 91 }, { pupilId: 6, name: "Sam", classId: "C1", mark: 60 }, ]; const klass: Klass[] = [ { classId: "C1", className: "Maths", teacher: "Lee" }, { classId: "C2", className: "Physics", teacher: "Roy" }, { classId: "C3", className: "Biology", teacher: "Fox" }, ]; // SELECT pupil.name, class.className // FROM pupil INNER JOIN class ON pupil.classId = class.classId for (const p of pupil) { const c = klass.find((k) => k.classId === p.classId); // the ON match console.log(p.name + " " + c.className); }

And here is the aggregate: gather pupils into piles by classId, then average each pile. The Map is doing the job of GROUP BY.

const pupil = [ { name: "Ada", classId: "C1", mark: 72 }, { name: "Ben", classId: "C2", mark: 55 }, { name: "Chloe", classId: "C1", mark: 88 }, { name: "Dev", classId: "C3", mark: 64 }, { name: "Ivy", classId: "C2", mark: 91 }, { name: "Sam", classId: "C1", mark: 60 }, ]; // SELECT classId, COUNT(*), AVG(mark) FROM pupil GROUP BY classId const groups = new Map<string, number[]>(); for (const p of pupil) { if (!groups.has(p.classId)) groups.set(p.classId, []); groups.get(p.classId).push(p.mark); } for (const [classId, marks] of groups) { const avg = marks.reduce((a, b) => a + b, 0) / marks.length; console.log(classId + " count=" + marks.length + " avg=" + avg.toFixed(1)); }

Digging deeper

Two mistakes trip almost everyone at first — know them cold for the exam.

Picture the two tables as overlapping circles (a Venn diagram). An INNER JOIN returns only the overlap — rows that exist on both sides. The OUTER joins reach beyond the overlap: a LEFT OUTER JOIN keeps the whole left circle (matched or not), a RIGHT OUTER JOIN the whole right circle, and a FULL OUTER JOIN both circles entirely. "Inner" means the inner overlap; "outer" means you also reach out to the parts with no partner. At A-level you'll mostly meet INNER and LEFT — but knowing they're the same machine with a different reach makes the whole family click.

A subtle but favourite exam point: COUNT(*) counts rows, but COUNT(column) counts only the rows where that column is not NULL. So in a table where two pupils have no recorded email, COUNT(*) might be 6 while COUNT(email) is 4. The same holds for AVG and SUM: they skip NULLs entirely, which means an average is taken over "the pupils who actually have a mark", not over every row. Blanks don't count — literally.