Reading scanned PDFs with vision models: building a production LLM content pipeline
Most "AI content" demos work because the input is clean. You paste in tidy text, the model summarizes it, everyone nods. Production is different. Real input is a firehose of inconsistent documents — some native digital text, some scanned PDFs of uneven quality, some near-duplicates of each other — arriving faster than any human team can triage.
We recently built a content pipeline for a digital media publication that had exactly this problem: hundreds of source documents per day that needed to become a single ranked, deduplicated daily digest, with a searchable archive and an editorial admin panel behind it. No large editorial team to throw at it. The pipeline had to do the triage.
This post walks through the architecture and the decisions that actually mattered — the ones that separate a demo from a system that runs every day without someone babysitting it.
The shape of the problem
Strip away the domain specifics and the pipeline does five things, in order:
- Ingest documents in mixed formats, including scanned PDFs.
- Extract candidate articles from each document.
- Cluster and deduplicate overlapping coverage of the same story.
- Score each story by editorial importance.
- Deliver a ranked digest, plus a searchable web archive and an admin panel for oversight.
Each stage looks simple in isolation. The difficulty is that errors compound: a bad OCR pass poisons extraction, weak deduplication inflates the digest with repeats, and a naive importance score buries the one story that mattered. So the real work was making each stage robust enough that the next one could trust its output.
Why we used vision models for OCR instead of a traditional engine
The single biggest quality lever was how we read scanned PDFs.
Traditional OCR engines are built to transcribe characters. They do well on clean, high-contrast scans and degrade badly on everything else — skewed pages, multi-column layouts, low-resolution faxed documents, tables, mixed fonts. When the transcription is wrong, it is wrong in ways that are hard to detect downstream: a garbled headline still looks like text, so the extraction stage happily treats nonsense as a real article.
Vision-capable LLMs change the economics here. Instead of transcribing characters in isolation, a vision model reads the page the way a person does — using layout, context, and language understanding to resolve ambiguity. A smudged word in a headline gets corrected by the surrounding sentence. A two-column layout gets read in the right order. A table becomes structured rather than a jumble of numbers.
The practical result: the pipeline reads a scanned PDF roughly as accurately as it reads native digital text. That single change removed an entire class of downstream failures that would otherwise have required manual cleanup.
The trade-off is cost and latency — vision inference is more expensive than a classical OCR pass. We handled that by routing only the documents that needed it (scanned or image-based inputs) through the vision path, while native-text documents took a cheaper route. Which leads to the part of the system we are most glad we built.
Avoiding provider lock-in: the swappable model layer
The LLM landscape moves monthly. A model that is the best price-to-quality choice today may be beaten next quarter; a provider you depend on may change pricing, deprecate a model, or hit a capacity ceiling during your peak. Hard-coding a single provider into a daily pipeline is a standing liability.
So every model call in the pipeline goes through a provider-agnostic abstraction — a thin internal interface that describes what we need (extract candidates from this document, cluster these items, score this story) rather than which provider does it. Behind that interface, each task is mapped to a provider and model chosen for that task's needs:
Three things this bought us:
- Swap providers without touching pipeline logic. When a cheaper or better model appears for a task, it is a one-line config change, not a refactor.
- Match the model to the job. High-volume extraction runs on a cheap fast model; the reasoning-heavy importance scoring runs on a stronger one. You do not pay flagship prices for work a small model handles fine.
- Graceful failover. If a provider degrades or rate-limits during a run, the router can fall back to an alternate for that task instead of failing the whole digest.
If you build one thing into an LLM system that you expect to run for years, build this. The model you start with is almost never the model you finish with.
The hard part: clustering and deduplication
Extraction — pulling candidate articles out of a document — is mostly solved by a capable model and a good prompt. The genuinely hard stage is deduplication, because "the same story" is fuzzy.
Two documents might cover the same event with completely different wording, length, and framing. A naive exact-match or simple similarity threshold either misses real duplicates (and floods the digest with repeats) or collapses distinct stories into one (and hides coverage). Neither is acceptable in a digest people rely on.
Our approach was a two-pass design:
- Embed and cluster. Every candidate is embedded into a vector representation and grouped by semantic similarity. This collapses the obvious near-duplicates and surfaces clusters of related coverage cheaply, without an expensive model call per pair.
- Resolve with a model. For ambiguous clusters — items that are close but not clearly the same story — a reasoning model makes the call: are these the same underlying event, or distinct stories that happen to share vocabulary? The model also picks the strongest version within a confirmed cluster and writes the canonical rewrite.
Every candidate is embedded and grouped by semantic similarity. Collapses the obvious near-duplicates without a model call per pair.
For close-but-unclear clusters, a reasoning model decides: same event, or just shared vocabulary? It picks the canonical version and writes the rewrite.
This split matters for cost and quality both. Embeddings do the cheap bulk work; the expensive reasoning model is reserved for the genuinely ambiguous cases, which are a small fraction of the total. Throwing a frontier model at every pairwise comparison would be slower and far more expensive for no quality gain.
Scoring by importance, not volume
A digest that shows everything is not a digest. The editorial promise is ranked output — the most important stories first, the long tail demoted or dropped.
Importance is subjective, so we did not try to hard-code it. The scoring stage uses a model prompted with the publication's editorial priorities to rank each deduplicated story, producing a score the admin panel can sort by and the digest can threshold against. Editors see ranked output and retain override control — the system proposes, the humans dispose. That last point matters: the goal was to remove the grunt work of triage, not to remove editorial judgment.
What we would tell a team starting this today
A few lessons that generalize beyond this project:
- The engineering matters more than the model. The wins came from architecture — routing inputs correctly, the two-pass dedup, the provider abstraction — not from picking a magic model. Any capable model would have worked; a weak pipeline around a great model still produces a bad digest.
- Isolate the expensive calls. Use cheap operations (embeddings, small models) for bulk work and reserve frontier reasoning for the small set of cases that need it. This is the difference between a pipeline that is affordable to run daily and one that is not.
- Assume your inputs are worse than the demo. Build for the scanned, skewed, near-duplicate reality from day one. The clean-input version is a prototype; the messy-input version is the product.
- Decouple from providers before you need to. The cost of the abstraction is small; the cost of being locked in when the landscape shifts is large.
- Keep a human in the loop where judgment lives. Automate the triage, not the editorial decision. Systems that respect that line get adopted; systems that overreach get switched off.
The stack, briefly
The front end, admin panel, and public web archive run on Next.js, deployed on Vercel. The processing pipeline itself runs as scheduled Python workers on Railway, triggered by a daily cron, reading and writing to Postgres and object storage, and reaching every model through the provider-agnostic layer described above. Nothing exotic — the value was in how the pieces fit together, not in the individual choices.
Clients
Vercel · Next.js
Railway · Python pipeline
Data
Model providers · ModelRouter
If your team is sitting on a pile of messy inputs and wondering whether an LLM can actually make sense of them, the answer is yes — but the answer is also that the engineering around the model is where the project succeeds or fails. That is the kind of work we do at Lumaya Partners: AI systems designed and shipped to production, not prototyped and abandoned. If that is what you need, tell us about it.
Building something AI-shaped?
This is the kind of work we take on. If your team needs an AI system designed and shipped to production, let's talk.