Evaluation Methods

You've researched users, sketched a design, and built a prototype. Now the crucial question: is it actually any good? Believing your own design is usable is the easiest mistake in the world — you built it, so of course you can use it. Evaluation is how you find out the truth: structured ways to measure a design's usability and catch its problems, using evidence instead of hope.

There are two big families. Expert-based methods have specialists inspect the design (fast, cheap, no users needed). User-based methods watch real people try it (slower, but reveals what users actually do). Good teams use both — experts catch the obvious problems cheaply, then users reveal the surprising ones.

Three methods you must know

Step through how each method works, who it involves, and what it's good at:

Heuristic evaluation (expert-based)

A handful of usability experts inspect the interface and check it against a list of established rules of thumb — heuristics. The most famous list is Nielsen's 10 heuristics, which includes things like visibility of system status (show what's happening — feedback!), match between the system and the real world (speak the user's language), user control and freedom (an undo/escape), consistency, error prevention, and recognition rather than recall. Each expert notes every place the design breaks a heuristic.

Strengths. Fast, cheap, no users to recruit; catches many problems early. Nielsen found that ~5 experts find the great majority of issues.
Weaknesses. Experts aren't real users, so they miss problems only genuine users hit, and may flag "issues" that never bother anyone. It tells you what might be wrong, not what is.

Usability testing (user-based)

Give real users representative tasks ("find and buy a blue size-M jumper") and watch. You measure things like success rate, time on task, errors, and where people get stuck, and often ask them to think aloud — narrating their thoughts — so you hear the confusion in real time. The gold of usability testing is watching someone fail at a task you thought was obvious.

Strengths. Reveals what users actually do (not what they say, or what experts predict); uncovers surprising, real problems.
Weaknesses. Slower and costlier (recruiting, sessions, analysis); a small sample; and people can behave slightly differently when watched.

A key idea: you don't need many users. Watching just 5 users typically uncovers around 80% of the usability problems — so testing often, with a few people, beats one giant study.

A/B testing (user-based, at scale)

Show version A to one random half of your live users and version B to the other half, then measure which performs better on a real metric (clicks, sign-ups, purchases, time spent). Because the two groups are large and randomly assigned, the difference is caused by the design change — it's a controlled experiment on real behaviour.

Strengths. Hard numbers from real users at scale; settles arguments ("does the green button really beat the blue one?") with data.
Weaknesses. Needs lots of live traffic; tells you which option won but not why; and only compares options you already have — it won't invent a better idea or explain the confusion behind the numbers.

It feels wrong — surely more is better? For finding usability problems, no. The same big, obvious issues trip up almost everyone, so by the fifth user you're mostly watching people hit the same walls you've already seen; extra users add little new. Nielsen's rule of thumb is that ~5 users reveal about 80% of the problems, so it's far smarter to run several small tests (fixing issues between them) than one huge, expensive one. (A/B testing is different — it measures a rate, so there you do want large numbers for a reliable result.)

Classic confusions to avoid:

Heuristic evaluation uses experts, not users. If real users are trying the product, that's usability testing, not heuristic evaluation. Don't mix them up.
A/B testing tells you which, not why. It compares two given options with numbers, but can't explain the reason or suggest a third, better design — pair it with usability testing to understand the behaviour behind the metric.
No method is a substitute for the others. Experts miss real-user surprises; usability tests use small samples; A/B tests need scale and only compare what you feed them. Combine them. And remember evaluation must include accessibility — a design that tests well for some users but excludes others has still failed.