In Beyond Scout, we laid out the problems physical AI has to solve across the worker funnel. Today, we explore a core question: how do you evaluate the prompts behind an AI interviewer, and keep them producing the same correct results as the roles, questions, and models underneath them change? A prompt that correctly vets a forklift role today still has to vet it correctly after that role’s requirements shift, after a new kind of question ships, and after we swap underlying model architecture.
Vetting is the most universal step in the labor lifecycle within the supply chain, and it follows that it’s also the highest stakes touch point. A false positive sends an unqualified worker to a business and burns the relationship. A false negative costs a qualified worker a shift — denying an opportunity for income. Rack up sufficient false positives, and you obviate the point of automation altogether — candidates will need to be double-vetted by a human. At our volume of processing tens of thousands of workers per week, precision is a core premise as opposed to a nice-to-have.
SEER — the Scout Evaluation & Experimentation Rig — is the most important system we built around Scout. It’s what lets us swap models, scale new question types, and hand prompt iteration first to operators and then to an agent.
To understand its origin, let’s consider the world prior to SEER.
Shipping prompts blind
During our first iterations of Scout, improving its assessment was a manual, nervous process. An engineer hand-configured their environment, dialed in for a mock interview, eyeballed the assessment that came back, and shipped if the handful of cases they tried looked right. There was no way to know whether the change had quietly regressed something else.
That was survivable when Scout was simple and linear. It stopped being survivable as the system grew, introducing attribute questions, multilingual calls, deduplication, branching evaluation criteria, etc. Every feature multiplied the pathways and the edge cases. Meanwhile, Scout’s assessments were driving more and more real-world outcomes at the same time. Testing prompt changes against production scenarios at volume became a daily occurrence.
What about off-the-shelf eval tools?
The obvious move, logically, would be to reach for a standard prompt-eval harness. But we have a core challenge — Scout has no single prompt to evaluate. It assembles each interview at runtime from variable context, and that context changes what “correct” even means.
Every interview is built from three kinds of questions:
Logistics questions — the standard screens every worker gets: availability, pay rate, transportation to the facility.
Custom questions — operator-authored checks for a specific campaign, run through a separate model call with its own output shape.
Attribute questions — role-specific questions the operator defines along with the criteria for a good, bad, or great answer: “Have you ever operated a reach truck in a cold-storage environment?”
Within this context, even the easy case isn’t easy. Logistics questions use versioned prompts in Langfuse, which gives them structural basis to test in isolation — but they still evaluate against variable context: whether a pay answer is acceptable depends on that role’s rate; whether a transportation answer works depends on the options at that facility. The same sentence from a worker is a pass for one shift and a fail for another.
Attribute questions are where the difficulty compounds. Operators first define the question set and the good/bad/great criteria for each attribute. The “prompt” here isn’t a static artifact you can drop in a playground — rather, it’s assembled at runtime from operator-defined logic plus a specific worker’s transcript. To throw another wrench — we run a thin LLM layer before Scout calls to deduplicate questions that have been previously asked, making it possible for two calls for the same role and same worker, to be correctly different. The set of questions, and therefore the set of evaluation criteria, varies per worker even within one campaign.
So to faithfully replay a single call, you have to reconstruct everything that call saw:
the operator-authored attribute questions for that role, as they were when the call ran
the good/bad/great criteria attached to each one
the role’s logistics parameters — pay rate, location, transportation requirements, shift times, etc.
the version of the logistics prompts that was live at call time
the worker’s prior interview history, which decides what questions get skipped
and finally, the transcript itself
None of this is static. It lives across the database, Langfuse, and per-worker state, and the assessment is only “correct” if all of it lines up the way it did in production. Off-the-shelf tools assume a stable prompt and a fixed dataset. We need something more dynamic.
Enter SEER
…so we built one. SEER replays a real production call against any prompt or model change and tells us, deterministically, whether Scout still reaches the same verdict an expert would. Three parts make that work:
A dataset of past call transcripts, each paired with the ground-truth assessment an operator confirmed
A runner pulls each item, rebuilds the exact context that call ran with, and executes Scout’s real assessment pipeline against it
A scorer compares the output to the ground truth criterion by criterion and rolls the run up into a stability report
Change a prompt, swap a model, point SEER at the dataset, and minutes later you have a per-criterion accuracy delta instead of a gut feeling.
HITL: Ground truth comes from the field
As it turns out, the best data labels within this context don’t come from engineers. They come from the people closest to the customer — our operations team, who spend their days with businesses and know what “qualified” actually means for a given role at a given facility. This enables us to introduce a robust human-in-the-loop process.
When Scout finishes an interview, operators review the assessment in the ops console. If Scout got it wrong — passed someone who shouldn’t have, or failed someone who clearly qualified — they override it. The operator picks a reason (poor attribute-question assessment, not enough follow-up, misread transcript) and supplies what the right answer should have been. The rig is careful about which corrections it trusts: if an override was logged for a reason that isn’t actually an assessment miss — bad audio, say — SEER reverts that field to Scout’s original call rather than poisoning the label.
Every trusted override becomes a labeled example: the transcript, what Scout said, and what was actually true. Those corrections flow continuously into datasets in Langfuse, so the eval corpus grows straight from production and tracks the real distribution of calls Scout handles, not a hand-picked set of easy ones.
SEER: replaying production, not prompts
For each item in that dataset, SEER runs Scout’s actual assessment pipeline — not a mock, not a simplified copy — injecting the real attribute questions and criteria that call used and reconstructing the full context the model saw in production. Then it compares Scout’s output against the operator-annotated ground truth and computes per-question accuracy.
That’s what separates SEER from a generic harness. Each dataset item carries its own evaluation context, so SEER measures whether the model reaches the same verdict a domain expert reached on that exact call, with everything that made the call what it was.
And it’s fully parameterized. You can swap the underlying model, override specific prompt text or evaluation criteria, and choose your scope — a single call, a whole campaign, or the entire dataset. Backtests run 40 at a time and finish in minutes; when a run completes, the results land in a Google Sheet for analysis and a notification posts to Slack so the team knows it’s ready to read.
Underneath, SEER is the first and most mature consumer of a generic dataset-runner we built alongside it, on purpose. The bet was that every eval rig at Traba would need the same machinery — fetch a dataset, run a pipeline, score it, report — and differ only in input shape and scoring logic. So the framework is a single abstract base class with six type parameters (input, expected output, actual output, run config, metadata, processed item), and it owns dataset fetching, concurrency, the per-item timeout, error capture, and reporting. A new rig implements four methods — run the pipeline, convert the output to the label shape, score it, collect the result — and brings a Zod schema for its inputs and outputs. Everything else is inherited.
Seamless model swaps: a ~70% cost cut
The clearest payoff so far? SEER let us migrate Scout’s assessment pipeline off Anthropic and onto Gemini 2.5 Flash, cutting per-assessment LLM cost by roughly 70%, without regressing accuracy.
Without SEER, a model migration is deploy-and-pray. You swap the model, run a few manual calls, and hope nothing breaks in the long tail you can’t test by hand. With SEER, it was methodical: we backtested the old model and the new one against the same dataset of operator-annotated transcripts, SEER surfaced the exact cases where Gemini diverged from ground truth, we iterated the prompts to close those gaps, re-ran to confirm, and shipped with data behind every decision.
Model swaps are exactly where production AI regresses silently. The new model is cheaper and faster, but it weighs a hedged answer differently, or reads “willing to work overtime” with a little less skepticism than the old one did. Those regressions don’t show up in happy-path testing; they show up weeks later, as operators noticing that workers are getting mis-qualified for certain roles. SEER moves that discovery to before the rollout. It catches the surprises that cut the other way, too. A newer Gemini release wasn’t automatically better: it scored worse on assessments and ran several times slower in backtest, so we stayed on the version that won on the data.
Changing a model in production is now a config flip through an internal model dispatcher — a registry that maps a logical name to a physical provider and model — and a SEER backtest gates the change before any traffic moves to it:
Closing the loop: an agent that improves its own prompts
Once SEER existed, the manual prompt-iteration job started to look suspiciously mechanical. Run the backtest. Find the worst category. Read the failed transcripts. Notice the pattern — Scout is too quick to assert “yes” when a worker hedges; Scout misses hours when someone says “weekday evenings” instead of listing days. Edit the prompt. Push it. Re-run. Did it improve? Did anything else regress? Repeat.
That’s a loop, so we wrote one. We gave Claude a small set of MCP tools that read and write Langfuse prompts, and a clear contract: pick a target category, run SEER, filter to that category’s failures, fetch the current prompt, propose an edit, push it, re-run. The agent only declares success when the target category clears its bar for three consecutive runs without regressing the others — and an engineer reviews the diff before it ships.
A few choices made the loop reliable enough to leave running:
Two modes, not one. A stability-check mode just runs the backtest and reports per-category health —
healthy,warning, orcritical— and then offers to iterate on whichever category looks worst. The human picks the target; the agent does the work.Iterate against
latest, promote toproduction. While iterating, the agent writes prompt versions taggedlatest, and backtests read that label, so it gets fast feedback on its own edits without touching what’s live. Only after three consecutive clean runs does it promote the version toproduction.Generalize, don’t memorize. Left unchecked, an LLM editing prompts will paste the exact failing transcript in as an “example” and watch that case jump to 100%. The guardrails are baked into the contract: prefer general principles over examples, never use values from failing test cases, never reference dataset items by content. The point is to clarify the rule, not memorize the test set.
The loop works because the rig was built to be read by an agent in the first place. Two architectural choices do most of it. First, failures come back structured instead of opaque: every item returns JSON with the model’s own reasoning sitting next to the expected and actual output, and for schedule availability we go further and emit a diff — missingDays, extraDays, periodMismatches — so the agent doesn’t have to re-derive what went wrong. The diagnosis is in the output. Second, the runner is typed and generic, so the agent never reasons about orchestration, retries, or concurrency. It reasons about prompts and scores, which is where the product actually lives.
The lesson is the same one that makes a system legible to a new engineer: discoverable contracts, structured outputs, and a clear line between safe-to-mutate and don’t-touch state. We didn’t build a clever agent. We built a system where the agent’s job is mostly done by the time it shows up. Prompt improvement — once an engineer’s afternoon of reading transcripts — is now something the system does to itself overnight. We pick the category that needs work in the morning, and we wake up to a new production prompt with three clean runs behind it and no regressions anywhere else in the rig.
The framework scales — and so does who uses it
Because the dataset-runner is generic, standing up a new eval rig is fast: each one brings its own dataset and scoring and inherits everything else. What started as SEER now backs evals well beyond Scout’s call assessment — worker preference facts, worker attribute facts, company facts, our timesheet agents, and more. We’ve found that a coding agent can reliably one- or two-shot a new rig given the framework’s interface.
The more interesting scaling has been organizational. SEER lives in the ops console, and operators run their own backtests. That matters, because operators are the ones drafting better attribute questions and tighter criteria — they know what “qualified” looks like for a role. When an operator wants to see whether rephrasing a question or raising the bar for a “great” answer improves accuracy, they run it against real transcripts straight from the console, no engineer in the loop. The eval rig stopped being a developer tool the moment it shipped to the people closest to the work.
Lessons and generalizations
A few things we think generalize past our use case:
Design for evaluation from the start. If your system assembles prompts from variable context — and most production AI does — off-the-shelf eval won’t capture what matters. Build eval that understands your domain’s structure.
The people closest to your customers are your best source of ground truth. Engineers can build the rig; the labels that make it useful come from the people who know what a right answer looks like in practice. Make annotation easy and structured, and the corpus grows from production on its own.
Build the rig for the future you can already see coming. We built SEER to improve call assessment — but we knew we’d eventually change models, standardize questions, and test at far higher volume. Building for that was cheap up front. Not having it when the moment arrived would have been expensive.
Treat the feedback loop as first-class infrastructure. Annotations feed datasets, datasets power backtests, backtests gate prompt and model changes, and those changes get validated by more backtests. That loop, from human judgment to automated evaluation and back, is what keeps a production AI system improving instead of drifting.
Once the loop is closed, an agent can run it. The iteration agent works because SEER’s output is machine-checkable, deterministic on a fixed dataset, and rich enough to catch regressions outside the target. Once your eval is structured enough that you’d automate against it yourself, an LLM probably can too. Most of the work was on the eval side, not the agent side.
Come build the physical world with us.



