Evals & Infra · AI & Automation

Ship AI you can trust.
Measured, not vibes.

Every model and prompt change graded against a frozen golden set before it ships — accuracy, safety, format, latency — with an LLM-as-judge, a regression gate in CI, and tracing on every run. No "looks good in the demo." A number that moves, or it doesn't merge.

Start a project → See it grade →

/01Drag the suite size · watch the gate hold the line

Live · eval suite, graded in CI

Pass rate 94%

▣ pytest evals/ --judge=claude --gate=0.90

Test cases0

Pass rate0%

Regressions caught0

p95 latency0ms

Test-suite size 480 cases

Eval pass rate94%

Regressions caught pre-ship9 in 10

Graded test cases5k

Median eval run3.4m

02 — Outcomes

Evals that held the line.

A ledger of named systems where the line that moved was a regression caught before users felt it — pass rate, safety, latency budget, drift held in check. 6 of 30 shown · ledger updates as suites grow.

Northwind Support

RAG eval · Support copilot

Frozen golden set of 600 graded questions, LLM-as-judge scoring faithfulness — a prompt "improvement" that quietly broke citations got blocked at the gate

+22%Pass rate

Cobalt Legal

Safety eval · Contract Q&A

Adversarial safety suite run on every model swap — jailbreak and leakage cases scored automatically, refusals checked against policy before release

0Safety regressions shipped

Vera Health

Regression gate · Clinical assistant

CI gate blocks any merge that drops below the clinical accuracy threshold — abstain-rate tracked so the model never trades safety for coverage

−87%Bad answers reaching users

Lumen Docs

Format eval · Developer assistant

Structured-output schema validated on every case — a model upgrade that broke JSON formatting was caught in 3 minutes, not by a user filing a bug

99.4%Schema-valid rate

Drift Finance

Latency budget · Research copilot

Latency tracked as a first-class eval dimension — p95 budget enforced in CI, so a slower prompt chain failed the gate before it hit production

−41%p95 latency

Forge Ops

Agent eval · Internal automation

Trajectory scoring on a tool-using agent — every run traced, tool calls graded for correctness, drift alerts fire when task success dips week-over-week

+34%Task success rate

03 — The harness, live

"Looks good" isn't a metric.
This is.

A frozen set of test cases, scored across four dimensions as pass/fail cells. Switch model/prompt versions and watch the grid recolour — v2 lifts accuracy, v3 regresses safety. Move the threshold and the gate decides: ship, or block. This is what "evaluated" actually looks like.

Golden set · 48 frozen cases

LLM-as-judge · scoring

pass fail dimension off

▣ Baseline = Prompt v1 · scoring 48 cases × 4 dimensions

Gate threshold · pass rate90%

Dimensions scored

Gate verdict · CI

—

Pass rate—

Δ vs baseline—

Gate—

Prompt v2 lifts accuracy but the gate also watches safety — that's why you weight dimensions. A win on one metric can hide a regression on another.

04 — Anatomy of the pipeline

Trust is a pipeline,
not a promise.

Confidence hides in the stages between a code change and a release — what you grade against, who judges, how the gate decides, what you trace in production. This is the room we work in: each stage measured, each tool chosen for a reason.

Eval pipeline · Northwind Support Copilot

Suite 600 cases · Pass rate 94% · Gate ≥ 90%

StageToolWhat it doesSignal

DatasetGolden set · frozen JSONLCurated input/output pairs, version-pinned so scores stay comparable600 cases

RunPromptfoo · pytestExecute every case against the candidate model/prompt, capture outputsdeterministic

MetricsRagas · custom scorersAccuracy, faithfulness, format-validity, latency scored per case4 dims

JudgeClaude Opus 4.8 · LLM-as-judgeGrades open-ended answers against a rubric where exact-match won't dojudging

GateGitHub Actions · CIBlock the merge if pass rate drops below threshold or a dimension regressesship / block

TraceOpenTelemetry · LangSmithEvery prompt, tool call, and token logged with a trace ID for replaytraced

MonitorBraintrust · GrafanaLive scores on real traffic, sampled and judged, dashboards on qualitylive scores

AlertDrift detection · SlackFire when quality dips week-over-week before users open a ticketon drift

green measured & in target

live the stage running in the demo above

amber watch · triggers on drift

05 — Ship to production

Define correct.

Before a single test runs, we pin down what "good" actually means for your system — which answers are right, which are unsafe, what format the product needs, and the latency users will tolerate. Vague quality goals make vague evals.

/ Week 00 · Define correctness

AccuracyWhat counts as a correct answer — exact-match where possible, rubric where not

SafetyRefusal policy, leakage, jailbreak resistance — the cases you can't fail

FormatSchema, structure, citations — what the product downstream depends on

Latencyp95 budget the experience can live with, scored like any other dimension

Build the golden set.

Pull real inputs from tickets, logs, and edge cases, label the expected behaviour, and freeze it. A frozen set is the whole point — scores only compare if the questions never move underneath you.

/ Week 01 · Golden set

SourceTickets · logs · adversarial · synthetic edge cases

LabelExpected output · acceptable variants · hard fails

CoverHappy path · long tail · safety · format breakers

FreezeVersion-pinned JSONL · checked into the repo

CurateGrow from production failures · prune the stale

Wire the judge.

Some answers grade themselves — exact match, schema valid, latency under budget. The open-ended ones need a judge: Claude scoring against an explicit rubric, calibrated against human labels so the grade is trustworthy.

/ Week 02 · Judge & metrics

Deterministic scorers · exact-match, schema, regex

LLM-as-judge · Claude Opus 4.8 with a written rubric

Calibration · judge vs human labels on a sample

Latency & cost scored per case, not after the fact

Per-dimension weights tuned with the product team

Gate the merge.

The eval suite runs in CI on every change. Pass rate above threshold and no dimension regressed — it ships. Below the line — it blocks, with a diff showing exactly which cases broke. The gate, not a gut feeling, decides.

/ Week 03 · CI gate

Trace everything.

In production, every request carries a trace — prompt, retrieved context, tool calls, tokens, latency. When something looks wrong, you replay the exact run, not guess from a screenshot. Observability is what makes a regression debuggable.

/ Week 04 · Observability

Pass rate94.1% — cases meeting the rubric this release

Safety100% — zero policy failures in the adversarial set

Format99.4% — schema-valid outputs downstream can trust

p95 latency1.8s — inside budget, tracked every run

Monitor & catch drift.

Live traffic is sampled and judged continuously, scores plotted over time. When quality dips week-over-week — a model update, a data shift, a new edge case — an alert fires before a user files the ticket. Evaluated once isn't evaluated.

/ Ongoing · Monitor & alert

Online eval sampling

Trace replay

Drift alerts

Quality dashboards

Regression gate in CI

Golden set curation

Judge re-calibration

Cost & latency budgets

06 — Why it compounds

A gated system holds.

Every caught regression feeds the next: each production failure becomes a new golden case, each near-miss tightens a threshold, each drift alert sharpens the suite. Ship-and-pray AI decays silently as models and data shift. Evaluated and gated, quality holds — and the suite that holds it keeps getting stronger.

Gated & evaluated by Stack Innovations — pass rate holds as the suite grows from real failures

Ship-and-pray — looks fine, then drifts as models update and edge cases pile up unmeasured

Representative of a typical 12-month engagement · pass rate on a frozen evaluation set.

Ship AI you can trust.
Measured, not vibes.

Evals that held the line.

"Looks good" isn't a metric.
This is.

Trust is a pipeline,
not a promise.

Define correct.

Build the golden set.

Wire the judge.

Gate the merge.

Trace everything.

Monitor & catch drift.

A gated system holds.

The kit, shown.

Stop guessing.
Start grading. Your AI.

Ship AI you can trust. Measured, not vibes.

Evals that held the line.

"Looks good" isn't a metric.This is.

Trust is a pipeline,not a promise.

Define correct.

Build the golden set.

Wire the judge.

Gate the merge.

Trace everything.

Monitor & catch drift.

A gated system holds.

The kit, shown.

Stop guessing.Start grading. Your AI.

Ship AI you can trust.
Measured, not vibes.

"Looks good" isn't a metric.
This is.

Trust is a pipeline,
not a promise.

Stop guessing.
Start grading. Your AI.