menu

chevron_left

Back to all engineering posts

Beyond Scout Ch. 1: Industrial Grade Evals

Beyond Scout Ch. 1: Industrial Grade Evals

Sumeet Bansal, Shiv Godhia, Jeff Chen

Cover illustration for a Traba engineering blog post

How Our Evals Grew Up With Our Agents

Last year, we wrote about Scout, Traba’s AI interviewer. Scout was the first of what’s now a fleet of agents in production — vetting workers, understanding their preferences, processing timesheets, profiling the businesses we staff, answering operational questions across our whole data stack. We’ll deep dive into select agents in the future, but in summary, our most elucidating learning was that the agent building turned out to be the easy part.


The true challenge turned out to be keeping all of them correct and safe, while they make real decisions about real people. As the agents grew more capable, the loops that keep them honest had to grow with them: from confirming a single verdict to trusting an agent to act on its own.


Traba runs on a simple operational model: in order to build scalable architecture that works for sophisticated, in-person systems, you must first do what does not scale. Go into the field, roll up your sleeves, learn what actually works, feed it back into the product, and scale the product. Scout’s improvement loop is one instance of that machine, tightened in stages:

  • Manual (early 2025) — An engineer hand-configured an environment, dialed in for a mock interview, eyeballed the result, and shipped if a handful of cases looked right — blind to whatever else broke. Essentially, deploy-and-pray.

  • Basic evals (mid 2025) — Input-output unit tests for the agent — replay a real call, check the verdict it returns against the answer an expert gave. Now every change reports, deterministically, what regressed (“trust me” → “show me”) That’s what let us swap Scout’s model for one ~90% cheaper with no loss in accuracy. This same harness now backs evals for our systems around worker facts, company facts, and timesheet agents.

  • Operators take the wheel (late 2025) — The people closest to the field stopped just flagging where Scout was wrong and started fixing it — rephrasing a question, raising the bar for a “great” answer, validating against real transcripts themselves. Every correction they log becomes a labeled example — and the rig deliberately drops the ones that aren’t real misses (bad audio, say) so they don’t poison the dataset. No engineer in the loop.

  • Agents in the loop (early 2026) — At this point, the failure signal is structured enough — each one carries the model’s own reasoning next to the ground-truth answer — that an agent can read it. Nightly, agents hunt for weak spots from the day’s interviews, rewrite, re-run, promote, and leave a diff for the morning. The guardrail that makes this safe: the agent has to clarify the underlying rule, never paste a failing transcript as an “example” to make that one case pass.


Each turn does two things:

  1. it pulls the engineer further out.

  2. it unbinds the loop from human effort — an engineer iterates a few times a week, an operator more often, a fleet of agents not at all. Further, the loops compound: evals on past calls, operators flagging from the field, and monitoring on live calls all feed the same machine. So improvement stops being something we run occasionally, when someone happens to notice a problem, and becomes something that runs constantly, in parallel, and proactively. Eval evolves into a data center of operators and strategists, running the loop all at once, all the time.


By the way: this isn’t the stop where we get off. We’re building toward a loop that closes around real outcomes: agents watching what happens on shift day — who showed up, who got sent home, who a business asked for by name again — and tuning the agents automatically against the thing we actually care about.


The end state: each agent becomes one node in a funnel that tunes itself against the ground truth of the physical world. We’re not there yet (and getting there is exactly the kind of problem we’re hiring for).


Why It’s Hard

Scout’s calls impact people’s lives, and therefore, its mistakes do as well.


False positive: an unqualified worker shows up at a warehouse and a business loses trust.
False negative: a qualified worker loses a shift and the income.


Enough misses and automation is pointless — a human ends up re-interviewing everyone by hand anyway.


There’s no SWE-bench here: “correct” is a judgment about a person, and it moves. The same answer passes for one shift and fails for another, depending on the role’s pay, the facility’s transportation options, and what the operator decided a “great” answer means here.


The secret sauce here is the specialization of the questions — “can you drive a forklift,” is not enough. You must evaluate “tell me about cold-storage reach-truck safety.” Scout assembles each interview at runtime from that context, so there’s no single prompt to test.


To illustrate the complexity, a quick deep dive into Scout’s evals.


Every Scout interview is built from three kinds of questions: logistics (availability, pay rate, transportation), custom operator-authored checks, and predefined attribute questions, for which the worker earns a provisional attribute should they pass (and the concrete, empirically derived attribute after they’ve successfully completed the requisite first shift).


To know a change is safe, you replay the call exactly as it happened, by reconstructing:

  • the attribute questions for that role, as they were when the call ran

  • the good/bad/great criteria attached to each one

  • the role’s logistics parameters — pay rate, location, transportation, shift times

  • the version of the logistics prompts that was live at call time

  • the worker’s prior interview history, which decides what questions were skipped due to duplication


None of it is static — it lives across the database, the prompt store, and per-worker state — and the verdict is only “correct” if all of it lines up the way it did in production.


That’s why off-the-shelf eval tools don’t fit — they assume a fixed prompt and a fixed answer key, and here neither holds.


The First Payoff

The first clear payoff from building comprehensive Scout evals came quickly. Our first Scout architecture made one holistic judgment on a worker per call — qualified or not qualified — and we could reproduce results by replaying real calls against the answer an expert gave. That gave us the confidence to do something most teams won’t risk: swap the model under a live agent.


Late last year, we moved the majority of Scout LLM usage from Anthropic to Gemini, cutting costs ~90% with no loss in accuracy, because the evals caught exactly where the new model misbehaved before it reached a worker:

  • it slipped into Spanish where Anthropic never did, and

  • its rule-following degraded on multi-step logic — it lost the thread on Scout’s step-by-step pay-rate clarification.


We fixed both and confirmed the fix before rolling out.


Later, when a newer, more capable Gemini timed out on long inputs like a multi-page contract, the same evals told us to stay put.


Evals aren’t a one-time migration tool — they’re how you keep up as the models keep moving under you.


Scaling Beyond Boolean Evals

Earlier Scout models allowed for simplified evals, as it rendered a single judgement per call. In building agents for the physical world, however, most of our agents expand far beyond.


Neutron, our omnipresent agent, answers BI questions, builds dashboards and data pipelines, resolves technical support, and executes complex operational workloads.


Our Neo product is a full decision intelligence suite that connects to the entirety of a supply chain and demystifies sophisticated workflows, deploys insights, self-improves, and takes consequential actions on behalf of the business.


These agents go far beyond a boolean result, and as such, you must grade the whole trajectory it took to get there. The evals we build need to account for:

  • An agent picking its own tools. We score the whole trajectory — which tools it calls, in what order, and which it correctly avoids — with a routing probe that inspects the tool it’s about to use before it runs. A right answer reached the wrong way doesn’t ship.

  • An agent acting autonomously in the world. Before we trust an agent to do things, not just answer, we run it against a seeded scenario and check the database that the reminder was actually scheduled, the dashboard actually built, the approval actually fired — proof it took the action, not just claimed to.

  • An agent talking to customers. Grading open-ended answers via LLM-as-a-judge for tone, completeness, and judgment — not just right-or-wrong — is what lets an agent hold a real conversation with a business instead of returning a verdict.

  • An agent facing adversaries. Standing attacks — prompt-leak, a “boiling frog” injection buried across turns, cross-tenant probes, data-exfiltration — every change is gated, so we can put an agent in front of the open world and trust it holds.

  • Move faster without breaking things. Nightly runs catch regressions that creep in over weeks, and the self-improvement cycle leaves us with end-to-end solutions in the morning.


Underneath all of these is one generic harness. Every rig — call assessment, worker facts, company facts, the timesheet agents — extends the same base: it brings its own dataset, its own scoring, and a schema for its inputs and outputs, and inherits the rest (dataset fetching, concurrency, per-item timeouts, reporting). Standing up a new one is cheap enough that a coding agent can usually one- or two-shot it from the interface.


The original bet still holds — replaying real production against human ground truth is the most honest signal we have. But the more that our agents take on human responsibility, the more it must earn our unyielding trust.


What we’d take from this

  • The hard part of production AI is the eval-and-improve loop. Standing up an agent is easy now. The make of a true engineering solution is in keeping it correct as the roles, the questions, and the models all shift underneath it.

  • Your best ground truth comes from the people closest to the work. Domain experts — not engineers — know what “right” looks like for a real shift at a real warehouse. Getting their judgment into the loop, and making it cheap to give, is the WD-40 that greases the whole machine.

  • Earn trust one capability at a time. Before you let an agent take an action, prove it takes the action; before you put it in front of a customer, prove it behaves. What you can evaluate is what you can ship — so the evals have to grow as fast as the agents do.

  • Build to take yourself out of the loop. In 2026, engineering cannot be the bottleneck of development. True victory is achieved when operators and agents run the evolution loop without you, pointed at real-world outcomes. The further the humans step back and the closer the loop gets to ground truth, the better the system runs.


Lastly — we’re hiring. All of the systems that we build are designed to be resilient in the messy, high-stakes physical world. We don’t sell shovels — we run the mines. If minesweeping is your jam, put on your hard hat and come aboard.

Copyright © 2025. All Rights Reserved by Traba

Empowering businesses and workers to reach their full productivity and potential.

Copyright © 2025. All Rights Reserved by Traba

Empowering businesses and workers to reach their full productivity and potential.

Copyright © 2025. All Rights Reserved by Traba

menu