Getting Started

Structure-D in 5 minutes — from installation to your first extraction.

What is Structure-D?

Structure-D is a document extraction framework built from the ground up for high-throughput structured inference with vLLM. It ingests any document format and returns validated, schema-constrained structured data guaranteed to match your Pydantic model — using vLLM's guided_json constrained decoding to make invalid output physically impossible.

When vLLM is not available, the same pipeline runs unchanged against OpenAI, Anthropic, Gemini, or Ollama — with zero code changes. Cloud providers are first-class citizens for prototyping, fallback, and frontier-model use cases.

It is designed around four principles:

  • vLLM-first — PagedAttention + continuous batching + guided_json means hundreds of concurrent extractions on a single GPU with guaranteed schema conformance.
  • Format-focused, not domain-specific — Structure-D handles file formats (PDF, DOCX, HTML, …), not document types. You define what you want to extract.
  • Schema-driven — Your Pydantic model is the contract. Structure-D validates every output and retries automatically on failure.
  • Async-first — Every I/O operation is async. Process thousands of documents in parallel without blocking.
i

Currently in beta

Structure-D v0.2.0 is production-ready for Phase 1 features. ETL scheduling and HITL workflows are on the roadmap for v0.3.0.

How it works

Every document flows through a six-stage pipeline:

  1. Ingestion — a format-appropriate parser extracts raw text, tables, and images from your file
  2. Preprocessing — text is normalized, cleaned, and chunked into optimal-size segments
  3. Model Routing — the router selects the best registered model for your task and schema
  4. Inference — vLLM uses guided_json constrained decoding to guarantee schema-valid output; cloud providers (OpenAI, Anthropic, Gemini, Ollama) use structured output / function-calling APIs
  5. Validation — every output is validated against the Pydantic model; invalid outputs trigger an automatic retry with a refined prompt
  6. Storage — results are written to JSONL, CSV, Markdown, or a database destination

vLLM is the primary inference engine

Structure-D is optimised for self-hosted vLLM. guided_json constrains token sampling directly to your JSON schema — the model physically cannot produce invalid output, so validation passes on the first attempt in almost all cases. Cloud providers (OpenAI, Anthropic, Gemini, Ollama) are fully supported as drop-in alternatives for prototyping or environments without a GPU server.

IN Ingest
PP Preprocess
RT Route
LLM Infer
VL Validate
ST Store

Quick start

Install Structure-D with the extras you need:

bash
pip install "structure-d[ingestion,api,llm]"

Then extract structured data from any document:

quickstart.py python
import asyncio
from pathlib import Path
from pydantic import BaseModel
from structure_d.pipeline import Pipeline
from structure_d.inference.providers import VLLMProvider  # primary engine

# 1. Define your extraction schema as a Pydantic model
class Invoice(BaseModel):
    vendor: str
    invoice_number: str
    total_amount: float
    line_items: list[str]
    due_date: str | None = None

# 2. Create the pipeline — VLLMProvider uses guided_json for guaranteed schema conformance
async def main():
    pipeline = Pipeline(
        schema_cls=Invoice,
        provider=VLLMProvider(),     # default: http://localhost:8000/v1
        # provider=AnthropicProvider()  # swap in any cloud provider with no other changes
    )

    # 3. Run on a file — format is auto-detected (PDF, DOCX, HTML, images, ...)
    results = await pipeline.run(Path("invoice.pdf"))

    for result in results:
        print(result.structured_output)
        # {"vendor": "Acme Corp", "invoice_number": "INV-001", "total_amount": 1240.00, ...}

asyncio.run(main())

Structure-D automatically detects the file format, selects the appropriate parser, chunks the content, runs inference with constrained decoding (guided_json on vLLM, structured output on cloud providers), and validates the result — retrying up to 3 times with a refined prompt on failure.

Use built-in schemas

Don't want to define a schema? Use one of the built-ins: key_value, table, entity, form, classification, or summary.

python
from structure_d.schemas.generic import get_schema

# Use a built-in schema by name
KeyValue = get_schema("key_value")
pipeline = Pipeline(schema_cls=KeyValue, provider=OpenAIProvider())

Common use cases

Structure-D is built for any task where you need to pull structured fields out of a document.

Invoice & receipt processing

Extract vendor name, line items, totals, dates, and tax information from PDF or scanned invoices.

PDF · PNG · JPEG

Contract analysis

Pull clauses, parties, dates, obligations, and termination conditions from legal documents.

PDF · DOCX

Form digitisation

Convert paper forms, questionnaires, and survey responses into structured JSON records.

PDF · PNG · JPEG · HTML

Report extraction

Parse financial reports, research papers, and annual filings into tables, summaries, and key metrics.

PDF · DOCX · XLSX

Email triage

Classify incoming emails, extract intent, priority, named entities, and structured action items.

EML · MSG · HTML

Knowledge base RAG

Build a retrieval-augmented QA system over a document corpus with built-in vector indexing.

PDF · DOCX · HTML · TXT

Core concepts

Schema-driven extraction

Every extraction job is backed by a Pydantic model. Structure-D converts your model to a JSON Schema and passes it to the LLM as a guided_json constraint (vLLM) or a tool-use parameter (OpenAI, Anthropic). The model is forced to produce output that matches your type signature.

python
from pydantic import BaseModel, Field

class Contract(BaseModel):
    parties: list[str] = Field(description="Names of all signing parties")
    effective_date: str = Field(description="Contract start date in YYYY-MM-DD format")
    termination_clause: str | None = Field(None, description="Termination conditions verbatim")
    jurisdiction: str = Field(description="Governing law / jurisdiction")

Automatic format detection

When you pass a Path to pipeline.run(), Structure-D inspects the file extension and MIME type to select the right parser automatically. You can also specify a parser explicitly.

Extension Default parser What gets extracted
.pdfpymupdfPer-page text, page count
.pdf (with tables)pdfplumberText + structured table data
.pdf (scanned)ocr_pdfOCR text via Tesseract
.png, .jpg, .tifftesseract_imageOCR text + base64 image for multimodal
.html, .htmhtmlStructured text (headings, lists), tables, meta tags, links, section pages
.docxdocxHeading-prefixed paragraphs, tables, core properties (author, title, dates), page count
.xlsxxlsxPer-sheet table data, cell values
.pptxpptxPer-slide text and notes
.emlemailSubject, body, headers
.srt, .vtttranscriptCleaned transcript text (timestamps stripped)
.txt, .mdplaintextRaw text
.csvplaintextRaw text (structured CSV rows)

Validation and automatic retry

After inference, SchemaValidator runs the raw LLM output through Pydantic. If validation fails, RetryHandler sends the model a new prompt that includes the original output, the Pydantic error messages, and a reminder of the schema. The default retry budget is 3 attempts.

Next steps

FAQ

Does Structure-D require a GPU?

No. When using cloud providers (OpenAI, Anthropic, Gemini), Structure-D runs entirely on CPU. The vllm and inference extras are only needed if you want to run local models on your own GPU server via vLLM.

Can I use it without an OpenAI API key?

Yes. Use OllamaProvider to run open-weight models locally via Ollama (CPU or GPU), or VLLMProvider to point at a self-hosted vLLM server. No external API key required.

What happens if the LLM returns invalid JSON?

SchemaValidator detects the validation failure and RetryHandler automatically retries with a corrective prompt. The prompt includes the original (invalid) response, the exact Pydantic validation errors, and the schema definition. After max_retries (default 3) attempts, the result is marked is_valid = False and returned with the validation_errors field populated.

How do I process an entire folder of PDFs?

Use pipeline.run_many(files, max_concurrent=8) where files is a list of Path objects. Structure-D processes them concurrently using asyncio.gather with a semaphore limit.

python
files = list(Path("invoices/").glob("*.pdf"))
results = await pipeline.run_many(files, max_concurrent=8, save_format="jsonl")
Is the output deterministic?

Structure-D sets temperature=0.0 by default, making outputs as deterministic as possible. vLLM's guided_json mode further constrains output to the exact token space of your schema, eliminating structural variation entirely. Cloud providers (OpenAI, Anthropic) are near-deterministic at temperature 0.

Can I use my own custom file format?

Yes. Extend BaseParser, implement async def parse(self, file_path), and register your parser with the ParserRegistry. See the Custom parsers guide.

Does it support multi-page documents?

Yes. PDF and Office parsers extract text per page. The Chunker then splits the document into overlapping segments. Each chunk is sent to the LLM independently, and pipeline.run() returns one ExtractionResult per chunk. For most documents you'll receive a list with a single result.

How do I handle scanned PDFs vs. text PDFs?

Set parser_name="auto" (the default) and Structure-D will try pymupdf first. If the extracted text is suspiciously short (likely a scanned image), it automatically falls back to ocr_pdf using Tesseract. You can also force OCR explicitly: pipeline.run(path, parser_name="ocr_pdf").