RAG & Search · AI & Automation

Answers from your data.
Grounded. Cited.

Retrieval-augmented generation that reads your own documents, retrieves the right passages, and answers with citations you can click — not a confident guess. Built to a faithfulness target, evaluated every release, and shipped to production.

Start a project → See it retrieve →

/01Drag the corpus · watch retrieval sharpen

Live · grounded retrieval, indexed

Faithfulness 96%

⌕ "What's our refund window for enterprise?"

Chunks indexed0

Recall @50%

Grounded rate0%

p95 latency0ms

Knowledge-base size 8,000 docs

Grounded-answer rate96%

Hallucinations cut−78%

Documents indexed40M

Median time-to-answer0.9s

02 — Outcomes

Retrieval that earned trust.

A ledger of named systems where the line that moved was a real answer, with a source — deflection, accuracy, time-to-answer, hallucinations avoided. 6 of 30 shown · ledger updates as systems scale.

Northwind Support

Support copilot · Help center + tickets

Hybrid retrieval over 12k articles, reranked, answers grounded with clickable citations — agents stopped guessing

−47%Escalations

Cobalt Legal

Contract Q&A · 40k agreements

Clause-level retrieval with strict citation-or-abstain, every answer traceable back to the source paragraph

96%Cited-answer rate

Vera Health

Clinical knowledge · Guidelines

Guideline retrieval with confidence gating — abstains and routes to a human when grounding is weak, never invents

−63%Hallucinated answers

Lumen Docs

Developer assistant · API docs

Semantic + keyword retrieval over docs and code, version-aware chunks so answers match the SDK in use

3.2×Self-serve resolution

Drift Finance

Research copilot · Filings + memos

Long-context retrieval across filings, metadata-filtered by entity and period, answers cite the exact exhibit

−71%Time-to-answer

Forge Ops

Internal wiki · Runbooks

Eval-driven retrieval over runbooks and Slack history, weekly faithfulness checks catch regressions before users do

+58%Answers resolved

03 — The retrieval, live

The model doesn't know.
It retrieves.

Ask a question and watch the pipeline run: embed the query, pull the nearest passages from your knowledge base, rerank them so the truly-relevant one wins, then answer with citations. Toggle grounding off and the model guesses from memory — confidently, and wrong.

Knowledge base · 142,000 chunks

Vector retrieval → rerank

⌕ What's our refund window for enterprise plans?

Passages retrieved · top-k4

Grounded answer · Claude

—

Retrieved—

Cited—

Grounding—

Embedding similarity alone ranks the wrong passage first — the reranker fixes it. That's why grounded systems rerank.

04 — Anatomy of the pipeline

Built like a pipeline,
not a prompt.

Quality hides in the stages between the question and the answer — how you chunk, what you embed with, whether you rerank, how you ground. This is the room we work in: each stage measured, each tool chosen for a reason.

RAG pipeline · Northwind Support Copilot

Index 142k chunks · Faithfulness 96% · p95 0.9s

StageToolWhat it doesSignal

IngestUnstructured · LlamaParseParse PDFs, HTML, docs into clean, structured textparsed

ChunkRecursive + semanticSplit on structure into ~512-token windows with overlap142k

EmbedVoyage AI · voyage-3Map each chunk to a 1024-dim vector (Anthropic's recommended embedder)1024-d

Indexpgvector · Pinecone · QdrantHNSW approximate-nearest-neighbour index over the vectorsHNSW

RetrieveHybrid · vector + BM25Pull top-k candidates by cosine similarity and keyword matchtop-k

RerankVoyage rerank-2 · CohereCross-encoder reorders candidates so the relevant one wins+19% nDCG

GenerateClaude Opus 4.8 · 1M ctxGrounded answer with Anthropic's native Citations, cache the corpuscited

EvaluateRagas · LLM-as-judgeScore faithfulness, context recall & precision every release96%

green measured & in target

live the stage running in the demo above

amber watch · below faithfulness target

05 — Ship to production

Scope the question.

Before a vector is embedded, we pin down what users actually ask, what counts as a correct answer, and where the truth lives. Then we build an evaluation set — the questions we'll grade every release against.

/ Week 00 · Scope & eval set

Questions120 real queries pulled from tickets, search logs, and interviews

TruthSource-of-record mapped per question — which doc holds the answer

Eval setGolden Q&A pairs with cited passages, frozen for grading

TargetFaithfulness ≥ 95%, abstain when grounding is weak

Ingest & chunk.

Parse the real sources — PDFs, wikis, tickets, code — into clean text, then chunk on structure so a passage holds one coherent idea. Garbage chunks retrieve garbage; this is where most RAG systems quietly fail.

/ Week 01 · Ingest

ParseUnstructured · LlamaParse · OCR for scans

CleanStrip boilerplate · de-dupe · normalise tables

ChunkStructure-aware · ~512 tokens · 64 overlap

MetadataSource · section · version · access tags

RefreshIncremental re-index on change · no full rebuilds

Embed & index.

Turn every chunk into a vector with a strong embedding model, then index it for fast nearest-neighbour search. Hybrid from day one — vectors for meaning, keywords for the exact term users actually type.

/ Week 02 · Embed & index

Embeddings · Voyage voyage-3 · 1024-dim

Vector store · pgvector / Pinecone · HNSW

Keyword index · BM25 for exact-term recall

Metadata filters · tenant · access · version

Hybrid fusion tuned against the eval set

Rerank & ground.

Top-k by similarity is a first pass, not the answer. A cross-encoder reranks the candidates, then Claude answers strictly from what survived — with native citations, and an honest "I don't know" when the passages don't support a claim.

/ Week 03 · Rerank & ground

Evaluate honestly.

Run the frozen eval set every change and score it — faithfulness, context recall, context precision, answer relevance. No "looks good in the demo." A number that moves, or the change doesn't ship.

/ Week 04 · Evaluate

Faithfulness96.2% — answers supported by retrieved text

Context recall93% — the right passage was retrieved

Precision88% — little irrelevant context in the window

Abstain rate4% — declined rather than guessed

Ship & monitor.

Live with caching for cost, guardrails for safety, and logging on every retrieval. We watch faithfulness as the corpus grows, catch drift before users do, and keep the system answering as your data changes underneath it.

/ Ongoing · Ship & monitor

Prompt caching · corpus

Retrieval logging

Drift alerts

PII + access guards

Citations on by default

Abstain on low grounding

Weekly eval run

Human-in-the-loop review

06 — Why it compounds

An evaluated system improves.

Every eval run feeds the next: failed questions sharpen the chunking, weak retrievals justify the reranker, hallucinations tighten the grounding. Ship-and-forget RAG drifts as your corpus grows stale. Evaluated and tended, faithfulness compounds.

Eval-driven by Stack Innovations — faithfulness climbs as retrieval and prompts tighten

Ship-and-forget — plateaus, then drifts as the corpus and edge cases pile up

Representative of a typical 12-month engagement · faithfulness on a frozen evaluation set.

Answers from your data.
Grounded. Cited.

Retrieval that earned trust.

The model doesn't know.
It retrieves.

Built like a pipeline,
not a prompt.

Scope the question.

Ingest & chunk.

Embed & index.

Rerank & ground.

Evaluate honestly.

Ship & monitor.

An evaluated system improves.

The kit, shown.

Stop guessing.
Start citing. Your data.

Answers from your data. Grounded. Cited.

Retrieval that earned trust.

The model doesn't know.It retrieves.

Built like a pipeline,not a prompt.

Scope the question.

Ingest & chunk.

Embed & index.

Rerank & ground.

Evaluate honestly.

Ship & monitor.

An evaluated system improves.

The kit, shown.

Stop guessing.Start citing. Your data.

Answers from your data.
Grounded. Cited.

The model doesn't know.
It retrieves.

Built like a pipeline,
not a prompt.

Stop guessing.
Start citing. Your data.