Stack Innovations / Services / AI & Automation / Fine-tuning
Fine-tuning · AI & Automation

When prompting isn't enough.
A model that's yours.

Prompting and RAG come first — they solve most problems faster and cheaper. When you still need a narrow style, a strict format, lower latency, or a smaller bill, we fine-tune a small model with LoRA, eval it before and after, and ship the version that actually wins.

/01Drag the dataset · watch the loss curve fall
Live · LoRA adapter, training
Val accuracy 94%
"Classify support tickets · 14 intents · house style"
Training examples0
Val accuracy0%
$ / 1k inferences$0
p95 latency0ms
Training examples 2,000 rows
Inference cost cut−82%
Format compliance99%
p95 latency drop−61%
Adapter size28MB
Trusted by teams shipping fine-tuned models to production at
02 — Outcomes

Tuned only where it paid.

A ledger of named models where fine-tuning earned its keep — cheaper inference, stricter format, lower latency, more consistent voice. Each one was prompted and RAG-tested first; only these crossed the line where a tuned model wins. 6 of 30 shown · ledger updates as models ship.

Northwind Triage
Classifier distillation · 14 intents
Distilled a frontier prompt into a small tuned classifier — same accuracy on the frozen eval set, a fraction of the per-call cost at high volume
−82%Inference cost
Cobalt Drafts
Style/format model · Legal memos
LoRA on 3k house-style examples so first drafts land in the firm's exact structure and tone — fewer prompt tokens, no brittle formatting instructions
99%Format compliance
Vera Edge
On-device model · Clinical intake
Quantised a tuned small model to run on-prem for privacy — no data leaves the building, and the round-trip dropped to single-digit milliseconds
−61%p95 latency
Lumen Tag
Structured extraction · Product feed
Tuned for strict JSON extraction across millions of SKUs — schema-valid every time, so the downstream pipeline stopped needing a repair pass
3.4×Throughput
Drift Voice
Brand-voice model · Lifecycle copy
A small model tuned on approved brand copy so generated lifecycle messages read on-brand by default — editors review, they no longer rewrite
+58%Copy accepted
Forge Route
Latency cut · Realtime routing
Replaced a large prompted router with a tuned tiny model on the hot path — same routing decisions, fast enough to sit inline on every request
−74%Cost / decision
03 — The training run, live

Watch a model learn.
And watch it overfit.

Fine-tuning is a curve, not a button. Drag the learning rate too high and the loss spikes and diverges; too low and it crawls. Shrink the dataset and the model memorises — validation loss turns back up while training loss keeps falling. That gap is overfitting, and it's the whole game.

Loss curve · LoRA · epoch 0 / 12
Training
Train loss Val loss
Fine-tune · small model · LoRA rank 16 · ticket classifier
Learning rate2e-4
too low · slowtoo high · diverges
Dataset size4,000 rows
small · overfitslarge · generalises
Epochs12
under-trainedover-trained
Epoch
Train loss
Val loss
Val accuracy
A healthy run: train and val loss fall together, then val flattens. The moment val turns up while train keeps dropping, you're memorising the data — stop early, or add more of it.
04 — Anatomy of the pipeline

Tuning is a pipeline,
not a weekend.

The model is the easy part. The work is the data — curating it, holding out an honest eval, picking the smallest base that can win, and proving the tuned version beats the prompt it replaces. This is the room we work in: data, train, eval, deploy.

Fine-tune pipeline · Northwind Triage Classifier
Base small model · Adapter LoRA r16 · Val acc 94%
StageToolWhat it doesSignal
BaselineClaude Opus 4.8 · prompt + RAGProve prompting and retrieval first — fine-tune only what they can't reachfirst
CurateLabel studio · dedupeGather clean, balanced examples; hold out a frozen eval split before anything4k rows
FormatJSONL · chat templateShape examples into instruction/response pairs the trainer expectsvalid
Base modelLlama · Mistral · Qwen (open)Pick the smallest open base that can clear the target on the eval set7B
TrainPEFT · LoRA / QLoRA · AxolotlTrain a small adapter, not the whole model — cheap, fast, reversibler16
EvaluateRagas · LLM-as-judgeScore tuned vs. baseline on the frozen set — ship only if it actually wins+5pt
QuantiseGGUF · AWQ · bitsandbytesShrink to int4/int8 for cheap, low-latency serving with little accuracy lossint4
DeployvLLM · Modal · TritonServe the adapter behind an API; monitor drift and cost in productionlive
green measured & in target
live the stage running in the demo above
amber watch · below the eval target
01
05 — Ship to production

Try without tuning.

Before a single GPU spins up, we push prompting and RAG as far as they go and grade them on a frozen eval set. Most of the time that's the answer — cheaper, faster to ship, nothing to maintain. Fine-tuning starts only where they hit a wall.

/ Week 00 · Baseline & eval set
PromptBest prompt + few-shot examples graded on the frozen eval set first
RAGRetrieval tried for any task that's really about missing knowledge
GapThe narrow thing left over — style, format, latency, or cost
DecisionFine-tune only if that gap is real and worth the upkeep

Curate the data.

A tuned model is only as good as its examples. We gather clean, balanced, representative pairs, strip duplicates and leakage, and freeze a held-out eval split before any training. Garbage in, confidently-wrong out — this is where most fine-tunes quietly fail.

/ Week 01 · Curate
GatherReal examples · labelled outputs · golden references
CleanDedupe · fix labels · remove eval leakage
BalanceEven the classes · cover the long tail
FormatJSONL · instruction / response · chat template
Hold outFrozen eval split · never trained on

Train the adapter.

We rarely touch the full model. LoRA and QLoRA train a small adapter on top of a frozen base — cheap, fast, and reversible. Sweep the learning rate, watch train and val loss together, and stop the moment validation turns up.

/ Week 02 · Train
Base model · smallest open model that can win
Adapter · LoRA / QLoRA · rank 16 · 4-bit base
Trainer · Axolotl / PEFT · gradient checkpointing
Sweep · learning rate · epochs · batch size
Early stop on validation loss · save best adapter

Eval before and after.

A lower training loss means nothing on its own. We score the tuned model against the prompt-only baseline on the frozen set — accuracy, format compliance, refusals, regressions. If it doesn't beat the baseline by a real margin, it doesn't ship.

/ Week 03 · Evaluate

Shrink & serve.

A tuned small model only pays off if it's cheap to run. We quantise to int4 or int8, merge or load the adapter, and serve it behind an API on vLLM or Modal — measuring real p95 latency and cost-per-call, not just a benchmark.

/ Week 04 · Quantise & serve
Val accuracy94.1% — tuned, vs 89% prompt-only baseline
Format valid99% — schema-valid JSON every call
p95 latency38ms — int4 on a small GPU
$ / 1k calls−82% — versus the frontier prompt it replaced

Monitor & retrain.

Data drifts and your domain moves. We log live inputs, watch accuracy against fresh held-out samples, and retrain the adapter on a cadence — cheap, because it's only an adapter. The base stays put; the tuned layer keeps pace.

/ Ongoing · Monitor & retrain
Adapter versioning
Drift monitoring
Live accuracy checks
Shadow eval on prod
Scheduled retrains
Rollback to last adapter
Cost & latency alerts
Human-in-the-loop review
06 — Why it compounds

A tended adapter keeps winning.

A fine-tune isn't a one-time event. Every batch of fresh production data sharpens the next adapter; every drift alert triggers a cheap retrain. Tune once and forget, and accuracy decays as your domain shifts underneath it. Monitored and retrained, the tuned model holds its edge.

Monitored & retrained by Stack Innovations — accuracy holds as the domain shifts
Tune-once-and-forget — decays as data drifts and the world moves on
Representative of a typical 12-month engagement · accuracy on a rolling held-out evaluation set.
07 — Tools · honest kit

The kit, shown.

The frameworks, trainers, and serving tools we actually wire together to curate, tune, evaluate, quantise, and deploy. No mystery platform — just the kit that turns a dataset into a small model that earns its place.

Baseline
Claude Opus 4.8
Hub & models
Hugging Face
Adapters
PEFT / LoRA
Trainer
Axolotl
Framework
PyTorch
Tracking
Weights & Biases
Serving
vLLM
Compute
Modal
Serving
Triton
Evaluation
Ragas
Serving
Python
Storage
S3
Start the build

Don't tune blind.
Tune where it pays. Your model.

A free fine-tuning audit to start — bring the task you're trying to nail and a sample of your data, and we'll tell you honestly whether prompting and RAG already get you there, or whether a small tuned model is the cheaper, faster win. A straight answer, not a pitch.

Get a fine-tuning audit
Accent
Hero shader
Motion