Fine-tuning · AI & Automation

When prompting isn't enough.
A model that's yours.

Prompting and RAG come first — they solve most problems faster and cheaper. When you still need a narrow style, a strict format, lower latency, or a smaller bill, we fine-tune a small model with LoRA, eval it before and after, and ship the version that actually wins.

Start a project → See a training run →

/01Drag the dataset · watch the loss curve fall

Live · LoRA adapter, training

Val accuracy 94%

⚙ "Classify support tickets · 14 intents · house style"

Training examples0

Val accuracy0%

$ / 1k inferences$0

p95 latency0ms

Training examples 2,000 rows

Inference cost cut−82%

Format compliance99%

p95 latency drop−61%

Adapter size28MB

02 — Outcomes

Tuned only where it paid.

A ledger of named models where fine-tuning earned its keep — cheaper inference, stricter format, lower latency, more consistent voice. Each one was prompted and RAG-tested first; only these crossed the line where a tuned model wins. 6 of 30 shown · ledger updates as models ship.

Northwind Triage

Classifier distillation · 14 intents

Distilled a frontier prompt into a small tuned classifier — same accuracy on the frozen eval set, a fraction of the per-call cost at high volume

−82%Inference cost

Cobalt Drafts

Style/format model · Legal memos

LoRA on 3k house-style examples so first drafts land in the firm's exact structure and tone — fewer prompt tokens, no brittle formatting instructions

99%Format compliance

Vera Edge

On-device model · Clinical intake

Quantised a tuned small model to run on-prem for privacy — no data leaves the building, and the round-trip dropped to single-digit milliseconds

−61%p95 latency

Lumen Tag

Structured extraction · Product feed

Tuned for strict JSON extraction across millions of SKUs — schema-valid every time, so the downstream pipeline stopped needing a repair pass

3.4×Throughput

Drift Voice

Brand-voice model · Lifecycle copy

A small model tuned on approved brand copy so generated lifecycle messages read on-brand by default — editors review, they no longer rewrite

+58%Copy accepted

Forge Route

Latency cut · Realtime routing

Replaced a large prompted router with a tuned tiny model on the hot path — same routing decisions, fast enough to sit inline on every request

−74%Cost / decision

03 — The training run, live

Watch a model learn.
And watch it overfit.

Fine-tuning is a curve, not a button. Drag the learning rate too high and the loss spikes and diverges; too low and it crawls. Shrink the dataset and the model memorises — validation loss turns back up while training loss keeps falling. That gap is overfitting, and it's the whole game.

Loss curve · LoRA · epoch 0 / 12

Training

Train loss Val loss

⚙ Fine-tune · small model · LoRA rank 16 · ticket classifier

Learning rate2e-4

too low · slowtoo high · diverges

Dataset size4,000 rows

small · overfitslarge · generalises

Epochs12

under-trainedover-trained

Epoch—

Train loss—

Val loss—

Val accuracy—

A healthy run: train and val loss fall together, then val flattens. The moment val turns up while train keeps dropping, you're memorising the data — stop early, or add more of it.

04 — Anatomy of the pipeline

Tuning is a pipeline,
not a weekend.

The model is the easy part. The work is the data — curating it, holding out an honest eval, picking the smallest base that can win, and proving the tuned version beats the prompt it replaces. This is the room we work in: data, train, eval, deploy.

Fine-tune pipeline · Northwind Triage Classifier

Base small model · Adapter LoRA r16 · Val acc 94%

StageToolWhat it doesSignal

BaselineClaude Opus 4.8 · prompt + RAGProve prompting and retrieval first — fine-tune only what they can't reachfirst

CurateLabel studio · dedupeGather clean, balanced examples; hold out a frozen eval split before anything4k rows

FormatJSONL · chat templateShape examples into instruction/response pairs the trainer expectsvalid

Base modelLlama · Mistral · Qwen (open)Pick the smallest open base that can clear the target on the eval set7B

TrainPEFT · LoRA / QLoRA · AxolotlTrain a small adapter, not the whole model — cheap, fast, reversibler16

EvaluateRagas · LLM-as-judgeScore tuned vs. baseline on the frozen set — ship only if it actually wins+5pt

QuantiseGGUF · AWQ · bitsandbytesShrink to int4/int8 for cheap, low-latency serving with little accuracy lossint4

DeployvLLM · Modal · TritonServe the adapter behind an API; monitor drift and cost in productionlive

green measured & in target

live the stage running in the demo above

amber watch · below the eval target

05 — Ship to production

Try without tuning.

Before a single GPU spins up, we push prompting and RAG as far as they go and grade them on a frozen eval set. Most of the time that's the answer — cheaper, faster to ship, nothing to maintain. Fine-tuning starts only where they hit a wall.

/ Week 00 · Baseline & eval set

PromptBest prompt + few-shot examples graded on the frozen eval set first

RAGRetrieval tried for any task that's really about missing knowledge

GapThe narrow thing left over — style, format, latency, or cost

DecisionFine-tune only if that gap is real and worth the upkeep

Curate the data.

A tuned model is only as good as its examples. We gather clean, balanced, representative pairs, strip duplicates and leakage, and freeze a held-out eval split before any training. Garbage in, confidently-wrong out — this is where most fine-tunes quietly fail.

/ Week 01 · Curate

GatherReal examples · labelled outputs · golden references

CleanDedupe · fix labels · remove eval leakage

BalanceEven the classes · cover the long tail

FormatJSONL · instruction / response · chat template

Hold outFrozen eval split · never trained on

Train the adapter.

We rarely touch the full model. LoRA and QLoRA train a small adapter on top of a frozen base — cheap, fast, and reversible. Sweep the learning rate, watch train and val loss together, and stop the moment validation turns up.

/ Week 02 · Train

Base model · smallest open model that can win

Adapter · LoRA / QLoRA · rank 16 · 4-bit base

Trainer · Axolotl / PEFT · gradient checkpointing

Sweep · learning rate · epochs · batch size

Early stop on validation loss · save best adapter

Eval before and after.

A lower training loss means nothing on its own. We score the tuned model against the prompt-only baseline on the frozen set — accuracy, format compliance, refusals, regressions. If it doesn't beat the baseline by a real margin, it doesn't ship.

/ Week 03 · Evaluate

Shrink & serve.

A tuned small model only pays off if it's cheap to run. We quantise to int4 or int8, merge or load the adapter, and serve it behind an API on vLLM or Modal — measuring real p95 latency and cost-per-call, not just a benchmark.

/ Week 04 · Quantise & serve

Val accuracy94.1% — tuned, vs 89% prompt-only baseline

Format valid99% — schema-valid JSON every call

p95 latency38ms — int4 on a small GPU

$ / 1k calls−82% — versus the frontier prompt it replaced

Monitor & retrain.

Data drifts and your domain moves. We log live inputs, watch accuracy against fresh held-out samples, and retrain the adapter on a cadence — cheap, because it's only an adapter. The base stays put; the tuned layer keeps pace.

/ Ongoing · Monitor & retrain

Adapter versioning

Drift monitoring

Live accuracy checks

Shadow eval on prod

Scheduled retrains

Rollback to last adapter

Cost & latency alerts

Human-in-the-loop review

When prompting isn't enough.
A model that's yours.

Tuned only where it paid.

Watch a model learn.
And watch it overfit.

Tuning is a pipeline,
not a weekend.

Try without tuning.

Curate the data.

Train the adapter.

Eval before and after.

Shrink & serve.

Monitor & retrain.

A tended adapter keeps winning.

The kit, shown.

Don't tune blind.
Tune where it pays. Your model.

When prompting isn't enough. A model that's yours.

Tuned only where it paid.

Watch a model learn.And watch it overfit.

Tuning is a pipeline,not a weekend.

Try without tuning.

Curate the data.

Train the adapter.

Eval before and after.

Shrink & serve.

Monitor & retrain.

A tended adapter keeps winning.

The kit, shown.

Don't tune blind.Tune where it pays. Your model.

When prompting isn't enough.
A model that's yours.

Watch a model learn.
And watch it overfit.

Tuning is a pipeline,
not a weekend.

Don't tune blind.
Tune where it pays. Your model.