v0.2.0 · Stable

Extract structured data
from any document

Built for vLLM. Structure-D ingests PDFs, DOCX, HTML, images, and 11 other formats and uses guided_json constrained decoding to guarantee schema-valid output at scale. Drop-in support for OpenAI, Anthropic, Gemini, and Ollama when a GPU is not available.

$ pip install "structure-d[ingestion,api,llm]"

or install the Rust CLI for native batch extraction

14+ File Formats
5 LLM Providers
110 Tests Passing
<1ms Validation overhead

Everything you need

Docs, AI & Changelog

One place for documentation, AI reference, and release history.

Docs

Complete documentation

Installation guides, configuration reference, and step-by-step tutorials for every feature — from basic extraction to advanced RAG pipelines.

  • Getting Started
  • Installation & Setup
  • Configuration
  • Basic Usage
  • Advanced Guides
Read the docs →
AI

AI reference

Full API reference for the Pipeline, all LLM providers, schemas, validators, and storage writers. Typed signatures and live examples.

  • Core Pipeline
  • vLLM + guided_json (primary)
  • OpenAI / Anthropic / Gemini / Ollama
  • Model Registry & Router
  • Schemas & Validators
Explore the API →
Changelog

Release history

Every release documented. See what's changed, what's fixed, and what's on the roadmap for v0.3 and beyond.

  • v0.2.0 — RAG layer + Rust CLI
  • v0.1.0 — Core pipeline launch
  • Roadmap: ETL scheduling
  • Roadmap: HITL review
  • MIT Licensed
View changelog →

Features

Everything you need to
structure unstructured data

Multi-Format Ingestion

Parse PDFs, images, HTML, DOCX, XLSX, PPTX, emails, audio transcripts, and plain text through a unified async interface.

Schema-Driven Extraction

Define any Pydantic model as your extraction target. Built-in schemas for key-value, table, entity, form, classification, and summary.

vLLM-First Inference

Built for vLLM — guided_json constrained decoding guarantees schema-valid output at high throughput. OpenAI, Anthropic, Gemini, and Ollama available as drop-in alternatives.

RAG & Vector Indexing

Built-in DocumentReader, VectorStoreIndex, and QueryEngine. Plug in ChromaDB, pgvector, or any custom vector store.

High-Throughput Batch

Async-first pipeline with concurrent batch processing. Run hundreds of documents in parallel with configurable concurrency limits.

Validated Output

Automatic schema validation on every extraction. LLM retry with refined prompts on failure. Zero invalid JSON reaching your application.

Quick Start

From file to
structured JSON
in 5 lines

Define a Pydantic model as your extraction schema, pass any file, get back validated results — automatically retried if the model output is invalid.

Auto-detects file format (PDF, image, HTML, ...)
Routes to the optimal model for your task
Validates output against your schema, retries on failure
Read the guide →
extract_invoice.py
import asyncio
from pathlib import Path
from pydantic import BaseModel
from structure_d.pipeline import Pipeline

class Invoice(BaseModel):
    vendor: str
    total: float
    line_items: list[str]

async def main():
    pipeline = Pipeline(schema_cls=Invoice)
    results = await pipeline.run(Path("invoice.pdf"))
    print(results[0].structured_output)
    # {"vendor": "Acme Corp", "total": 1240.0}

asyncio.run(main())

Architecture

A modular six-stage pipeline

Each stage is independently configurable and replaceable — override any component with your own implementation.

IN
Ingest
PP
Preprocess
RT
Route
LLM
Infer
VL
Validate
ST
Store

Supported input formats

PDFPNG/JPGHTMLDOCXXLSXPPTXEmailAudioCSVMarkdownPlain Text

Open Source · MIT License

Start extracting structured data
in minutes

No API key required for local models. Deploy as a FastAPI service or use the Rust CLI.