May 20, 2026 5 min read

How I Test AI Models (And Why It Matters for Clients)

Once a week I run the same prompt through four different AI models. Same project. Same context. Same instructions. Four outputs come back. They are never the same.

One model produces 700 words of structured analysis with a gap table I can act on immediately. Another produces 250 words of correct-but-thin bullet points. Another picks up an emotional signal in the client feedback that the others missed entirely — the client went quiet not because they were busy, but because they were losing confidence and did not know how to say it. Another finishes in under a second and costs a fraction of a cent.

If I were building websites the way most agencies do, none of this would matter. You write the code, you ship the site, you move on. But I do not build websites that way. Every project I ship goes through multiple AI-powered reviews before it reaches a client — checks for client alignment, end-user experience, technical quality, scope creep, brand consistency. The output from those reviews is only as good as the model producing it. So I test.

What I actually measure

When I compare models, I am not looking for a winner. I am looking for fit. The right model for the right task. Here are the five dimensions I track.

1. Structural completeness

Did the model produce every section it was asked for? This sounds basic, but it is the most common failure mode. A prompt that asks for a client archetype, an expectation model, a gap table, and risk flags will routinely come back from some models with the gap table missing or the risk flags reduced to a single sentence. If I am reviewing a project before it ships, I need the full picture — not a summary of the summary.

2. Detail density

Ten specific observations are worth more than three generic ones. "The Instagram feed ticket has been open since week 2 — the client raised it, never escalated, and is now communicating less frequently" is an observation I can act on. "Social media integration could be improved" is not. Some models consistently produce the first kind. Others default to the second. I track the ratio.

3. Emotional granularity

This is the one that matters most for client-facing work. When a client goes from daily messages to twice-weekly to once, that is not a scheduling change. That is a signal. The best models read it correctly: the client is cooling, not busy. They are not angry — they are quietly deciding whether the work is going to meet their expectations. Weaker models describe the same behavior as "reduced communication frequency" and move on. The difference between those two readings is the difference between a retained client and a lost one.

4. Speed and cost

These only enter the conversation when quality is equal. If two models produce structurally identical output at the same level of detail, the faster and cheaper one wins — not because I am cutting corners, but because efficiency that does not sacrifice quality is just good operations. Over a full project lifecycle with dozens of AI-powered reviews, those differences compound.

5. Reliability

Does the model produce consistent quality across 15 different projects, or is it hit-or-miss? A model that delivers excellent output 80% of the time and nonsense 20% of the time is not usable — I cannot ship work that has a one-in-five chance of being unreviewed. Consistency is a dimension of its own, separate from peak performance.

What this means for you

When someone builds your business website, you are trusting them with the first thing most customers will ever see of your business. The question to ask is not "do you use AI?" — everyone does now. The question is: "how do you know the AI is doing good work?"

If the answer is "we tried a few models and picked the one that felt right," that is a guess. If the answer is "we test every model on the same work every week and route each task to the one that performs best on that specific type of work," that is a process.

I do the second one. It takes about 30 minutes a week. I run the same governance review through multiple models, compare the outputs across all five dimensions, and update the routing. When a model improves on a dimension it was previously weak on, it earns more work. When a model starts skipping sections, it gets pulled back. The pipeline is built to be model-agnostic — not because switching is fun, but because betting everything on one provider is not a strategy.

I am one person. But every project is checked from multiple angles by AI that I have tested, measured, and routed deliberately. That is the difference between using AI and being disciplined about it.

If you are thinking about a new website or online store, tell me about your business. I will show you what that process looks like in practice.