Stack Innovations / Services / AI & Automation / Evals & Infra
Evals & Infra · AI & Automation

Ship AI you can trust.
Measured, not vibes.

Every model and prompt change graded against a frozen golden set before it ships — accuracy, safety, format, latency — with an LLM-as-judge, a regression gate in CI, and tracing on every run. No "looks good in the demo." A number that moves, or it doesn't merge.

/01Drag the suite size · watch the gate hold the line
Live · eval suite, graded in CI
Pass rate 94%
pytest evals/ --judge=claude --gate=0.90
Test cases0
Pass rate0%
Regressions caught0
p95 latency0ms
Test-suite size 480 cases
Eval pass rate94%
Regressions caught pre-ship9 in 10
Graded test cases5k
Median eval run3.4m
Trusted by teams shipping evaluated AI to production at
02 — Outcomes

Evals that held the line.

A ledger of named systems where the line that moved was a regression caught before users felt it — pass rate, safety, latency budget, drift held in check. 6 of 30 shown · ledger updates as suites grow.

Northwind Support
RAG eval · Support copilot
Frozen golden set of 600 graded questions, LLM-as-judge scoring faithfulness — a prompt "improvement" that quietly broke citations got blocked at the gate
+22%Pass rate
Cobalt Legal
Safety eval · Contract Q&A
Adversarial safety suite run on every model swap — jailbreak and leakage cases scored automatically, refusals checked against policy before release
0Safety regressions shipped
Vera Health
Regression gate · Clinical assistant
CI gate blocks any merge that drops below the clinical accuracy threshold — abstain-rate tracked so the model never trades safety for coverage
−87%Bad answers reaching users
Lumen Docs
Format eval · Developer assistant
Structured-output schema validated on every case — a model upgrade that broke JSON formatting was caught in 3 minutes, not by a user filing a bug
99.4%Schema-valid rate
Drift Finance
Latency budget · Research copilot
Latency tracked as a first-class eval dimension — p95 budget enforced in CI, so a slower prompt chain failed the gate before it hit production
−41%p95 latency
Forge Ops
Agent eval · Internal automation
Trajectory scoring on a tool-using agent — every run traced, tool calls graded for correctness, drift alerts fire when task success dips week-over-week
+34%Task success rate
03 — The harness, live

"Looks good" isn't a metric.
This is.

A frozen set of test cases, scored across four dimensions as pass/fail cells. Switch model/prompt versions and watch the grid recolour — v2 lifts accuracy, v3 regresses safety. Move the threshold and the gate decides: ship, or block. This is what "evaluated" actually looks like.

Golden set · 48 frozen cases
LLM-as-judge · scoring
pass fail dimension off
Baseline = Prompt v1 · scoring 48 cases × 4 dimensions
Gate threshold · pass rate90%
Dimensions scored
Gate verdict · CI
Pass rate
Δ vs baseline
Gate
Prompt v2 lifts accuracy but the gate also watches safety — that's why you weight dimensions. A win on one metric can hide a regression on another.
04 — Anatomy of the pipeline

Trust is a pipeline,
not a promise.

Confidence hides in the stages between a code change and a release — what you grade against, who judges, how the gate decides, what you trace in production. This is the room we work in: each stage measured, each tool chosen for a reason.

Eval pipeline · Northwind Support Copilot
Suite 600 cases · Pass rate 94% · Gate ≥ 90%
StageToolWhat it doesSignal
DatasetGolden set · frozen JSONLCurated input/output pairs, version-pinned so scores stay comparable600 cases
RunPromptfoo · pytestExecute every case against the candidate model/prompt, capture outputsdeterministic
MetricsRagas · custom scorersAccuracy, faithfulness, format-validity, latency scored per case4 dims
JudgeClaude Opus 4.8 · LLM-as-judgeGrades open-ended answers against a rubric where exact-match won't dojudging
GateGitHub Actions · CIBlock the merge if pass rate drops below threshold or a dimension regressesship / block
TraceOpenTelemetry · LangSmithEvery prompt, tool call, and token logged with a trace ID for replaytraced
MonitorBraintrust · GrafanaLive scores on real traffic, sampled and judged, dashboards on qualitylive scores
AlertDrift detection · SlackFire when quality dips week-over-week before users open a ticketon drift
green measured & in target
live the stage running in the demo above
amber watch · triggers on drift
01
05 — Ship to production

Define correct.

Before a single test runs, we pin down what "good" actually means for your system — which answers are right, which are unsafe, what format the product needs, and the latency users will tolerate. Vague quality goals make vague evals.

/ Week 00 · Define correctness
AccuracyWhat counts as a correct answer — exact-match where possible, rubric where not
SafetyRefusal policy, leakage, jailbreak resistance — the cases you can't fail
FormatSchema, structure, citations — what the product downstream depends on
Latencyp95 budget the experience can live with, scored like any other dimension

Build the golden set.

Pull real inputs from tickets, logs, and edge cases, label the expected behaviour, and freeze it. A frozen set is the whole point — scores only compare if the questions never move underneath you.

/ Week 01 · Golden set
SourceTickets · logs · adversarial · synthetic edge cases
LabelExpected output · acceptable variants · hard fails
CoverHappy path · long tail · safety · format breakers
FreezeVersion-pinned JSONL · checked into the repo
CurateGrow from production failures · prune the stale

Wire the judge.

Some answers grade themselves — exact match, schema valid, latency under budget. The open-ended ones need a judge: Claude scoring against an explicit rubric, calibrated against human labels so the grade is trustworthy.

/ Week 02 · Judge & metrics
Deterministic scorers · exact-match, schema, regex
LLM-as-judge · Claude Opus 4.8 with a written rubric
Calibration · judge vs human labels on a sample
Latency & cost scored per case, not after the fact
Per-dimension weights tuned with the product team

Gate the merge.

The eval suite runs in CI on every change. Pass rate above threshold and no dimension regressed — it ships. Below the line — it blocks, with a diff showing exactly which cases broke. The gate, not a gut feeling, decides.

/ Week 03 · CI gate

Trace everything.

In production, every request carries a trace — prompt, retrieved context, tool calls, tokens, latency. When something looks wrong, you replay the exact run, not guess from a screenshot. Observability is what makes a regression debuggable.

/ Week 04 · Observability
Pass rate94.1% — cases meeting the rubric this release
Safety100% — zero policy failures in the adversarial set
Format99.4% — schema-valid outputs downstream can trust
p95 latency1.8s — inside budget, tracked every run

Monitor & catch drift.

Live traffic is sampled and judged continuously, scores plotted over time. When quality dips week-over-week — a model update, a data shift, a new edge case — an alert fires before a user files the ticket. Evaluated once isn't evaluated.

/ Ongoing · Monitor & alert
Online eval sampling
Trace replay
Drift alerts
Quality dashboards
Regression gate in CI
Golden set curation
Judge re-calibration
Cost & latency budgets
06 — Why it compounds

A gated system holds.

Every caught regression feeds the next: each production failure becomes a new golden case, each near-miss tightens a threshold, each drift alert sharpens the suite. Ship-and-pray AI decays silently as models and data shift. Evaluated and gated, quality holds — and the suite that holds it keeps getting stronger.

Gated & evaluated by Stack Innovations — pass rate holds as the suite grows from real failures
Ship-and-pray — looks fine, then drifts as models update and edge cases pile up unmeasured
Representative of a typical 12-month engagement · pass rate on a frozen evaluation set.
07 — Tools · honest kit

The kit, shown.

The frameworks, judges, and infra we actually wire together to grade, gate, trace, and monitor AI in production. No mystery harness — just the kit that keeps quality measured.

RAG eval
Ragas
Eval runner
Promptfoo
Observability
LangSmith
Online eval
Braintrust
Tracing
OpenTelemetry
Test runner
pytest
Judge
Claude Opus 4.8
CI gate
GitHub Actions
Dashboards
Grafana
Store
Postgres
Serving
FastAPI
Alerts
Slack
Start the build

Stop guessing.
Start grading. Your AI.

A free eval audit to start — bring a model or prompt you ship today and a set of real cases, and we'll build a golden set, run it through an LLM-as-judge, and show you exactly where quality is silently slipping. A scorecard, not a pitch.

Get an eval audit
Accent
Hero shader
Motion