My first eval killed v1 · ibrahimali.ca

I shipped the classifier on vibes.

That’s not quite fair. I read its outputs, eyeballed a few obvious wins, and convinced myself it worked. The pipeline was real: a PR opens, Copilot reviews it, and a small LLM sorts each Copilot comment into APPLY, SKIP_MINOR, SKIP_WRONG, or NEEDS_HUMAN. APPLY items get patched in. Skips get dropped. NEEDS_HUMAN items land in a summary so I can decide.

It felt fine. PRs were getting cleaner. Comments were getting filed somewhere. I’d built the thing.

Then I read the Braintrust article on evals.

The article that broke the spell

The premise is simple enough to sting: if you can’t measure your LLM’s behaviour against a fixed set of inputs you’ve judged yourself, you don’t know whether your changes are improving anything. You’re prompting on feel. Every “small tweak” is a coin flip you can’t tell from the previous coin flip.

I’d been doing exactly that. Reading three or four classifier outputs, deciding the prompt was good, moving on. No baseline. No labelled set. No way to tell, six prompt revisions later, whether the thing was getting better or quietly getting worse.

So I picked the smallest version of an eval I could justify and ran it.

The smallest possible eval

I had a bounded test set sitting in plain sight: every comment my v1 classifier had marked NEEDS_HUMAN across two real PRs on a sister project. Nine items. Small enough to label in a single sitting.

The test wasn’t “is the classifier good.” It was narrower: when v1 said “I can’t decide, escalate this to a human,” was that the right call?

I read each Copilot comment. Read the surrounding code. Wrote down what I, the human, would actually do: apply, skip, or genuinely defer. Then compared my labels to the classifier’s verdict.

Three of the nine were genuine NEEDS_HUMAN. Judgement calls a model shouldn’t make alone.

The other six were the classifier flinching. Comments it could and should have classified, dressed up as ambiguity so it didn’t have to commit.

What 3-of-9 actually means

In a system designed to take work off my plate, a NEEDS_HUMAN verdict is the most expensive output. It’s the only one that requires my attention. If two thirds of those escalations are unnecessary, the classifier isn’t filtering. It’s rerouting nitpicks to a more polite-looking inbox.

Worse, the failure mode is invisible to vibes-based testing. If you only inspect the APPLY bucket, the system looks great. Everything that lands in the PR is reasonable. The damage is in the bucket you stop reading once you’ve decided the thing works.

That’s the part I want to remember. Evals don’t catch the bugs you’re already worried about. They catch the ones your existing inspection routine systematically skips over.

The verdict band

The Braintrust framing gives you bands: a small labelled set lets you sort your model into ship, iterate, or redesign. Three out of nine on the most sensitive bucket isn’t a tuning problem. You don’t fix that by softening a sentence in the prompt.

So v1 is dead. The classifier is being rebuilt with:

A new output schema that separates verdict (apply / skip / escalate) from priority (how badly does this matter) from reason_class (why is the model saying this). Three signals where v1 had one mush.
A per-thread decision log: every Copilot comment gets a reply explaining what the classifier did and why, so I can audit individual calls instead of squinting at a summary.
A larger eval harness: 33 hand-labelled items across three repos and four review styles, so the next candidate has somewhere to land before it touches a real PR.

The harness is now the load-bearing artifact. The classifier is replaceable. The labelled set is the thing the project is building around.

What I think evals actually are

I came in thinking evals were a kind of test suite. Slower, fuzzier, same shape. They aren’t.

A test asks “does this code do what I said it does.” An eval asks “do I agree with what this model decided.” The labelling step is where the actual product judgement gets pinned down. Until you’ve sat with thirty examples and written down what the right call would have been, you don’t have a product spec. You have a vague sense of one.

That’s why labelling has to come before the prompt. I tried it the other way the first time, and the prompt I wrote was a description of behaviour I hadn’t pinned down yet. Of course it produced a classifier that flinched.

What I’d do differently next time

If I were starting this project today, I’d:

Label 20-30 examples before writing a single line of classifier prompt.
Define the output schema by asking what columns I want in the labelling spreadsheet, not by asking what fields the model should produce.
Treat NEEDS_HUMAN (or any “ask a person” verdict) as a budget. If more than a small share of items land there, the classifier is failing, not deferring.
Refresh the harness when real-world inputs drift, not on a fixed cadence.

None of this is novel. It’s the eval discipline ML teams have been doing for years, and that I, in solo-dev land, had been quietly assuming I could skip.

I couldn’t.

What’s next

The next post in this thread will be the v2 design: the new schema, the per-thread decision log, and the first run of the candidate classifier against the 33-item harness. The interesting question isn’t “did v2 beat v1.” It’s whether the harness gives me strong enough signal to know.

I’m cautiously optimistic. But cautiously is doing a lot of work in that sentence, and the only thing licensing it is that this time, I’ll have numbers.