Stack Innovations
Start a project
Computer vision that looks at an image and returns something you can act on — boxes around the objects, text lifted off the page, regions masked out, and clean structured fields. Specialised detectors find it, Claude multimodal reads it, and every release is measured against an accuracy target before it ships.
A ledger of named systems where the line that moved was a real detection, a field read off a page, a defect caught — accuracy, review time, throughput, errors avoided. 6 of 30 shown · ledger updates as systems scale.
Point it at a scene and watch the pipeline run: a detector boxes every object it finds, the confidence threshold decides which survive, and the task toggle switches the job — detect objects, read text off the page, or segment regions. Turn on structure and the boxes become clean, typed fields you can store.
Quality hides in the stages between the pixels and the field — how you pre-process, which detector you run, whether you OCR or segment, how you turn boxes into typed structure. This is the room we work in: each stage measured, each model chosen for the job it's best at.
Before a single model runs, we pin down what counts as a correct detection, where the edge cases hide, and what the output has to look like downstream. Then we build an evaluation set — labelled images we'll grade every release against.
Vision lives or dies on data. We gather real images from the real environment — the actual cameras, lighting, and angles — then label them carefully. Garbage labels train garbage detectors; this is where most vision projects quietly fail.
Pick the right detector for the job — YOLO for fast real-time boxes, Detectron2 for accuracy, Segment Anything when a box isn't precise enough. Fine-tune on your labels so it learns your classes, not a generic benchmark's.
Boxes and masks aren't the answer — structure is. OCR reads text out of the regions, then Claude multimodal understands the crop and returns typed, validated fields. An honest "low confidence, send to a human" when the image doesn't support a clean read.
Run the frozen eval set every change and score it — precision, recall, mAP, field accuracy. No "looks good in the demo." A number that moves on the real test images, or the change doesn't ship.
Live with optimised models on GPU, batched for throughput, and logging on every frame. We watch accuracy as cameras and lighting drift, catch model decay before users do, and keep the system seeing clearly as the world in front of the lens changes.
Every eval run feeds the next: missed objects become new training data, false positives tighten the threshold, hard scenes get labelled and added. Ship-and-forget vision decays as cameras, lighting, and the world drift. Evaluated and re-trained, accuracy compounds.
The models, runtimes, and tools we actually wire together to ingest, detect, segment, read, and structure. No mystery framework — just the kit that keeps detections accurate and fields clean.
A free vision audit to start — bring a folder of your real images and the result you need out of them, and we'll show you what a detection-to-structure pipeline would return, and where today's manual process falls down. A prototype, not a pitch.
Get a vision audit →