How to extract structured data from unstructured text at scale

The demo is always easy. You take a document, write a prompt, get back a JSON object with the fields you wanted. Thirty minutes of work, clean output, done.

Then you try to run it at scale and everything breaks in a different way.

The documents are messier than your test samples. The same field appears in four different formats. Some documents are in a different language. Some are scanned PDFs with OCR artifacts. The model occasionally hallucinates values that aren’t in the source. Your schema evolved and nothing was updated to match.

Extraction at scale is an engineering problem, not a prompting problem. This is what it actually involves.

Define your schema explicitly before writing any code

The most common mistake in extraction pipelines is treating the schema as an afterthought. Teams write the prompt first and figure out the schema from the outputs.

This produces pipelines where the schema is implicit, unversioned, and in constant flux. When you need to add a field, you’re guessing at what the model currently returns. When a downstream system breaks because a field type changed, you’re reconstructing what changed and when.

Start with an explicit schema definition. Use JSON Schema, Pydantic, Zod, or whatever fits your stack — the specific tool matters less than the fact that it’s written down and version-controlled. Your schema should specify:

Every field by name and type
Which fields are required vs. optional
Allowed values for enumerated fields
Nested structures and array types explicitly

This schema becomes the contract between your extraction pipeline and everything that consumes its output. It doesn’t constrain what the model can do; it constrains what your pipeline accepts.

Separate the extraction from the validation

A common implementation collapses these into one step: prompt the model, parse the JSON, use the result. This works until the model output doesn’t conform to what you expected.

The more maintainable approach keeps them separate:

Source document
  → Extraction (model inference)
  → Raw output
  → Validation (schema check, type coercion, required field check)
  → Structured result (or validation error)

The extraction step doesn’t care about validity. Its job is to get the model to produce something JSON-shaped that maps to your schema. The validation step handles everything after that: confirming types, checking for required fields, normalizing values.

When validation fails, you have two choices: retry with correction prompting (pass the validation errors back to the model and ask it to fix the output), or surface the failure for human review. Which you choose depends on whether the failure is likely fixable by the model or likely a document that doesn’t contain the data you’re looking for.

Handle document variance explicitly

Real-world documents don’t look like your test samples. You need to account for this at the pipeline design level, not the prompt level.

Format variance. The same information might appear as “Founded: 1998”, “Founded in 1998”, “Est. 1998”, or just “1998” depending on the document. Your schema accepts an integer (or null). Your extraction needs to handle all the surface forms and produce the integer, not just work for the canonical form.

Missing data. Not every document contains every field in your schema. A document that doesn’t mention the company’s founding year isn’t a failure — it’s a valid extraction result where founded_year is null. Your pipeline needs to distinguish between “field wasn’t in the document” and “field was in the document but couldn’t be extracted.”

Long documents. Models have context limits. A document that’s 50 pages of dense text can’t necessarily be passed to a model as-is. You need a strategy: chunk the document and extract from each chunk, use retrieval to identify the relevant sections first, or use a model with a larger context window. Each approach has different cost/accuracy tradeoffs.

Multilingual content. If your documents can be in multiple languages, decide explicitly whether to translate first or extract directly. Extraction directly from non-English text works but accuracy varies. Translation adds cost and latency but produces more consistent extraction.

Design for idempotency and replay

Extraction pipelines process documents in batches, often asynchronously. Documents fail, infrastructure has incidents, schemas change. You need to be able to re-run extraction on a document without creating duplicates, corrupted state, or inconsistent results.

This means:

Store raw inputs durably before processing. Before you extract anything, write the source document to persistent storage with a stable identifier. If extraction fails or the output needs to be recalculated, you can replay from the stored input without fetching it again.

Use document fingerprints. Hash the source document content. If you need to re-extract, you can check whether the content has actually changed since the last extraction. If it hasn’t, return the cached result rather than re-running inference.

Track extraction state explicitly. Don’t infer whether a document has been processed from whether an output record exists. Maintain explicit state: pending, processing, complete, failed. This makes it possible to reliably identify what needs to be re-run after an infrastructure failure.

Cost and latency at volume

A single extraction might cost $0.002 and take 1.5 seconds. At 10,000 documents per day, that’s $20/day and a pipeline that takes 4+ hours to process a day’s input sequentially.

A few approaches that matter:

Batch where the model allows it. Some models and providers support batch inference at lower cost. If your use case tolerates higher latency in exchange for lower cost (overnight processing, for example), batch APIs are worth using.

Cache aggressively. If the same document appears multiple times — which is more common than it sounds in real-world datasets — return the cached result rather than re-running inference.

Right-size the model. A large frontier model isn’t always the right choice for extraction. For well-defined schemas with limited document variance, smaller models are often faster, cheaper, and nearly as accurate. Test smaller models before defaulting to the largest one available.

Parallelize with care. Parallel extraction is faster, but it interacts with rate limits and can produce burst costs that don’t show up until the end of the month. Set explicit concurrency limits and monitor your token usage rate, not just your document count.

Schema evolution without breaking downstream systems

Extraction schemas change. New requirements emerge, the source documents evolve, you discover fields that should have been in the schema from the start.

The safest pattern is additive-only changes:

New optional fields are always safe to add
Widening a type (integer → integer | string) is generally safe
Renaming a field, removing a field, or narrowing a type requires versioning

Version your schema explicitly and expose the schema version in your extraction outputs:

{
  "schema_version": "2.1",
  "extracted_at": "2026-01-21T09:14:32Z",
  "document_id": "doc_01HY...",
  "data": { ... }
}

Downstream consumers can check schema_version and handle changes explicitly rather than discovering them from runtime errors.

When to use a dedicated extraction service

Building all of this in-house is feasible but not small. A production-ready extraction pipeline with schema validation, retry logic, idempotency, and monitoring represents several weeks of engineering work before you get to the domain-specific logic.

The build-vs-use-a-service question comes down to a few factors:

Is extraction core to your product, or a component in a larger system?
Do you need custom model fine-tuning for specialized domains?
Do you have specific data residency or compliance requirements that limit what services you can use?

If extraction is a component (most cases), a dedicated service gives you the reliability infrastructure without the engineering cost, and lets you focus on what you’re actually building.

Structr is a structured data extraction API with schema validation, typed output guarantees, and retry logic built in. Define your schema, point it at your documents, get reliable structured output.