Stack Innovations / Services / AI & Automation / RAG & Search
RAG & Search · AI & Automation

Answers from your data.
Grounded. Cited.

Retrieval-augmented generation that reads your own documents, retrieves the right passages, and answers with citations you can click — not a confident guess. Built to a faithfulness target, evaluated every release, and shipped to production.

/01Drag the corpus · watch retrieval sharpen
Live · grounded retrieval, indexed
Faithfulness 96%
"What's our refund window for enterprise?"
Chunks indexed0
Recall @50%
Grounded rate0%
p95 latency0ms
Knowledge-base size 8,000 docs
Grounded-answer rate96%
Hallucinations cut−78%
Documents indexed40M
Median time-to-answer0.9s
Trusted by teams shipping RAG to production at
02 — Outcomes

Retrieval that earned trust.

A ledger of named systems where the line that moved was a real answer, with a source — deflection, accuracy, time-to-answer, hallucinations avoided. 6 of 30 shown · ledger updates as systems scale.

Northwind Support
Support copilot · Help center + tickets
Hybrid retrieval over 12k articles, reranked, answers grounded with clickable citations — agents stopped guessing
−47%Escalations
Cobalt Legal
Contract Q&A · 40k agreements
Clause-level retrieval with strict citation-or-abstain, every answer traceable back to the source paragraph
96%Cited-answer rate
Vera Health
Clinical knowledge · Guidelines
Guideline retrieval with confidence gating — abstains and routes to a human when grounding is weak, never invents
−63%Hallucinated answers
Lumen Docs
Developer assistant · API docs
Semantic + keyword retrieval over docs and code, version-aware chunks so answers match the SDK in use
3.2×Self-serve resolution
Drift Finance
Research copilot · Filings + memos
Long-context retrieval across filings, metadata-filtered by entity and period, answers cite the exact exhibit
−71%Time-to-answer
Forge Ops
Internal wiki · Runbooks
Eval-driven retrieval over runbooks and Slack history, weekly faithfulness checks catch regressions before users do
+58%Answers resolved
03 — The retrieval, live

The model doesn't know.
It retrieves.

Ask a question and watch the pipeline run: embed the query, pull the nearest passages from your knowledge base, rerank them so the truly-relevant one wins, then answer with citations. Toggle grounding off and the model guesses from memory — confidently, and wrong.

Knowledge base · 142,000 chunks
Vector retrieval → rerank
What's our refund window for enterprise plans?
Passages retrieved · top-k4
Grounded answer · Claude
Retrieved
Cited
Grounding
Embedding similarity alone ranks the wrong passage first — the reranker fixes it. That's why grounded systems rerank.
04 — Anatomy of the pipeline

Built like a pipeline,
not a prompt.

Quality hides in the stages between the question and the answer — how you chunk, what you embed with, whether you rerank, how you ground. This is the room we work in: each stage measured, each tool chosen for a reason.

RAG pipeline · Northwind Support Copilot
Index 142k chunks · Faithfulness 96% · p95 0.9s
StageToolWhat it doesSignal
IngestUnstructured · LlamaParseParse PDFs, HTML, docs into clean, structured textparsed
ChunkRecursive + semanticSplit on structure into ~512-token windows with overlap142k
EmbedVoyage AI · voyage-3Map each chunk to a 1024-dim vector (Anthropic's recommended embedder)1024-d
Indexpgvector · Pinecone · QdrantHNSW approximate-nearest-neighbour index over the vectorsHNSW
RetrieveHybrid · vector + BM25Pull top-k candidates by cosine similarity and keyword matchtop-k
RerankVoyage rerank-2 · CohereCross-encoder reorders candidates so the relevant one wins+19% nDCG
GenerateClaude Opus 4.8 · 1M ctxGrounded answer with Anthropic's native Citations, cache the corpuscited
EvaluateRagas · LLM-as-judgeScore faithfulness, context recall & precision every release96%
green measured & in target
live the stage running in the demo above
amber watch · below faithfulness target
01
05 — Ship to production

Scope the question.

Before a vector is embedded, we pin down what users actually ask, what counts as a correct answer, and where the truth lives. Then we build an evaluation set — the questions we'll grade every release against.

/ Week 00 · Scope & eval set
Questions120 real queries pulled from tickets, search logs, and interviews
TruthSource-of-record mapped per question — which doc holds the answer
Eval setGolden Q&A pairs with cited passages, frozen for grading
TargetFaithfulness ≥ 95%, abstain when grounding is weak

Ingest & chunk.

Parse the real sources — PDFs, wikis, tickets, code — into clean text, then chunk on structure so a passage holds one coherent idea. Garbage chunks retrieve garbage; this is where most RAG systems quietly fail.

/ Week 01 · Ingest
ParseUnstructured · LlamaParse · OCR for scans
CleanStrip boilerplate · de-dupe · normalise tables
ChunkStructure-aware · ~512 tokens · 64 overlap
MetadataSource · section · version · access tags
RefreshIncremental re-index on change · no full rebuilds

Embed & index.

Turn every chunk into a vector with a strong embedding model, then index it for fast nearest-neighbour search. Hybrid from day one — vectors for meaning, keywords for the exact term users actually type.

/ Week 02 · Embed & index
Embeddings · Voyage voyage-3 · 1024-dim
Vector store · pgvector / Pinecone · HNSW
Keyword index · BM25 for exact-term recall
Metadata filters · tenant · access · version
Hybrid fusion tuned against the eval set

Rerank & ground.

Top-k by similarity is a first pass, not the answer. A cross-encoder reranks the candidates, then Claude answers strictly from what survived — with native citations, and an honest "I don't know" when the passages don't support a claim.

/ Week 03 · Rerank & ground

Evaluate honestly.

Run the frozen eval set every change and score it — faithfulness, context recall, context precision, answer relevance. No "looks good in the demo." A number that moves, or the change doesn't ship.

/ Week 04 · Evaluate
Faithfulness96.2% — answers supported by retrieved text
Context recall93% — the right passage was retrieved
Precision88% — little irrelevant context in the window
Abstain rate4% — declined rather than guessed

Ship & monitor.

Live with caching for cost, guardrails for safety, and logging on every retrieval. We watch faithfulness as the corpus grows, catch drift before users do, and keep the system answering as your data changes underneath it.

/ Ongoing · Ship & monitor
Prompt caching · corpus
Retrieval logging
Drift alerts
PII + access guards
Citations on by default
Abstain on low grounding
Weekly eval run
Human-in-the-loop review
06 — Why it compounds

An evaluated system improves.

Every eval run feeds the next: failed questions sharpen the chunking, weak retrievals justify the reranker, hallucinations tighten the grounding. Ship-and-forget RAG drifts as your corpus grows stale. Evaluated and tended, faithfulness compounds.

Eval-driven by Stack Innovations — faithfulness climbs as retrieval and prompts tighten
Ship-and-forget — plateaus, then drifts as the corpus and edge cases pile up
Representative of a typical 12-month engagement · faithfulness on a frozen evaluation set.
07 — Tools · honest kit

The kit, shown.

The models, stores, and tools we actually wire together to ingest, retrieve, rerank, ground, and evaluate. No mystery framework — just the kit that keeps answers faithful.

Generation
Claude Opus 4.8
Embeddings
Voyage AI
Citations
Anthropic Cite
Vector DB
pgvector
Vector DB
Pinecone
Vector DB
Qdrant
Reranker
Cohere Rerank
Ingest
Unstructured
Orchestration
LlamaIndex
Evaluation
Ragas
Serving
FastAPI
Comms
Slack
Start the build

Stop guessing.
Start citing. Your data.

A free retrieval audit to start — bring a stack of your real documents and a list of real questions, and we'll show you what a grounded, cited system would answer, and where today's tool falls down. A prototype, not a pitch.

Get a retrieval audit
Accent
Hero shader
Motion