Evaluation Methods

You've researched users, sketched a design, and built a prototype. Now the crucial question: is it actually any good? Believing your own design is usable is the easiest mistake in the world — you built it, so of course you can use it. Evaluation is how you find out the truth: structured ways to measure a design's usability and catch its problems, using evidence instead of hope.

There are two big families. Expert-based methods have specialists inspect the design (fast, cheap, no users needed). User-based methods watch real people try it (slower, but reveals what users actually do). Good teams use both — experts catch the obvious problems cheaply, then users reveal the surprising ones.

Three methods you must know

Step through how each method works, who it involves, and what it's good at:

Heuristic evaluation (expert-based)

A handful of usability experts inspect the interface and check it against a list of established rules of thumb — heuristics. The most famous list is Nielsen's 10 heuristics, which includes things like visibility of system status (show what's happening — feedback!), match between the system and the real world (speak the user's language), user control and freedom (an undo/escape), consistency, error prevention, and recognition rather than recall. Each expert notes every place the design breaks a heuristic.

Usability testing (user-based)

Give real users representative tasks ("find and buy a blue size-M jumper") and watch. You measure things like success rate, time on task, errors, and where people get stuck, and often ask them to think aloud — narrating their thoughts — so you hear the confusion in real time. The gold of usability testing is watching someone fail at a task you thought was obvious.

A key idea: you don't need many users. Watching just 5 users typically uncovers around 80% of the usability problems — so testing often, with a few people, beats one giant study.

A/B testing (user-based, at scale)

Show version A to one random half of your live users and version B to the other half, then measure which performs better on a real metric (clicks, sign-ups, purchases, time spent). Because the two groups are large and randomly assigned, the difference is caused by the design change — it's a controlled experiment on real behaviour.

It feels wrong — surely more is better? For finding usability problems, no. The same big, obvious issues trip up almost everyone, so by the fifth user you're mostly watching people hit the same walls you've already seen; extra users add little new. Nielsen's rule of thumb is that ~5 users reveal about 80% of the problems, so it's far smarter to run several small tests (fixing issues between them) than one huge, expensive one. (A/B testing is different — it measures a rate, so there you do want large numbers for a reliable result.)

Classic confusions to avoid: