All articlesStack

Evals on day zero

An agent without evals is a complaint waiting to happen. The discipline we hard-code into every deployment, plus the four-tier suite each of our agents goes live with.

Apr 28, 2026 3 min read
evalsagenticproductionviox-os

You can usually predict which AI agents are going to fail in production by asking one question: how do you measure when this agent gets worse?

Most teams have some version of "we don't." They built an agent, it worked in the demo, it went live, and they're hoping it still works. They'll find out from a customer.

That is the path from "AI initiative" to "the AI thing that keeps embarrassing us." So we hard-code a different rule into every deployment: no agent goes live without an eval suite. Not "we'll add evals later." Day zero, before the agent talks to a real user.

Why up front

Three reasons.

Models drift. The Claude Sonnet you tested last month has the same name and not the same weights. Anthropic updates checkpoints. A prompt that worked at temperature 0.7 might over-correct at the new checkpoint. Without evals, you find out from a customer.

Prompts decay. Every time someone "improves" the prompt to handle a new edge case, there's a meaningful chance they break two or three cases that were already working. Without an eval suite running on every prompt change, prompt engineering is just vibes.

Tools change. The CRM API returns a slightly different field. The search API rate-limits one new endpoint. Your agent silently works around it by hallucinating. Evals catch this; humans don't.

If any of those three matter to your business, you need evals. They all matter to every business.

The four-tier suite

Each of our agents goes live with the same shape: smoke, capability, adversarial, regression.

Tier 1, smoke (~20 cases, runs on every commit). Does the agent respond at all? Does the response contain the structured output the next step expects? Does tool calling actually invoke tools? These run in under thirty seconds and catch the 60% of failures that are infrastructure problems disguised as model problems.

Tier 2, capability (~80–150 cases, runs nightly). Can the sales agent draft an outreach email when given a lead and a CRM context? Can the billing agent categorize a transaction correctly? Can the voice agent capture a callback request and post it to Twilio? Curated cases that exercise the actual job. Scored against rubrics — exact match doesn't work for non-deterministic output.

Tier 3, adversarial (~30 cases, runs nightly). Prompt injections, jailbreaks, off-topic detours, customer hostility. Agents that pass capability tests still fail adversarial, and they fail in ways that get screenshot and posted publicly. Always run these.

Tier 4, regression (grows over time). Every production failure becomes a regression test. A customer complains the support agent gave a wrong refund policy on a Tuesday — that conversation gets sanitized, added to the regression set, and runs forever. The suite gets sharper every week.

How they run

We run evals through LangSmith and Helicone for the standard work, and Arize for the harder evaluation tasks. Each agent run produces a trace, each trace gets scored against the rubric, and the scores land on a dashboard the operator actually sees.

The rule that matters: a falling score is a fact, not a vibe. When the sales agent's email-quality score drops from 7.4 to 6.8 over a week, we don't argue about whether the agent feels worse. We open the regression cases that started failing and we fix them.

What it does for the deployment

When we put a system into production, the eval dashboard goes live before the agent does. The operator sees a baseline score on day one, sees the score over time, sees which categories pass and which fail. They have a number to point at when leadership asks whether the AI is working.

That number is the difference between a real AI deployment and an expensive demo.

It also means we can't hide behind plausible deniability. If the score drops, our customers see it before we tell them. The pressure to keep quality high is structural, not aspirational.

If you're running an agent in production right now and you can't answer "what's the eval score this week," start small. Pick twenty representative customer interactions from the last thirty days. Write a rubric: five to seven things a good response should do, two or three it should never do. Run the twenty cases through your agent, score each, average them. That's your baseline. Re-run weekly. If the score moves more than 10%, investigate. Twenty cases scored weekly is dramatically better than nothing, and dramatically better than what most teams running production agents have today.

/ 06 — Start hereOne business day response

Tell us what you'd like built.

Send us a paragraph about the workflow, phone line, or tool you want built. We'll reply within one business day with a one-page plan, a fixed price, and a delivery date you can put on a calendar.

  • 30-min scoping call, free
  • Written proposal within 48 hours
  • Fixed price before we start
  • Most builds delivered in 2–8 weeks