v0.2.0 · Stable

Extract structured data
from any document

Built for vLLM. Structure-D ingests PDFs, DOCX, HTML, images, and 11 other formats and uses guided_json constrained decoding to guarantee schema-valid output at scale. Drop-in support for OpenAI, Anthropic, Gemini, and Ollama when a GPU is not available.

Get Started AI Reference

$ pip install "structure-d[ingestion,api,llm]"

or install the Rust CLI for native batch extraction

Everything you need

Docs, AI & Changelog

One place for documentation, AI reference, and release history.

Docs

Complete documentation

Installation guides, configuration reference, and step-by-step tutorials for every feature — from basic extraction to advanced RAG pipelines.

Getting Started
Installation & Setup
Configuration
Basic Usage
Advanced Guides

Read the docs →

AI reference

Full API reference for the Pipeline, all LLM providers, schemas, validators, and storage writers. Typed signatures and live examples.

Core Pipeline
vLLM + guided_json (primary)
OpenAI / Anthropic / Gemini / Ollama
Model Registry & Router
Schemas & Validators

Explore the API →

Changelog

Release history

Every release documented. See what's changed, what's fixed, and what's on the roadmap for v0.3 and beyond.

v0.2.0 — RAG layer + Rust CLI
v0.1.0 — Core pipeline launch
Roadmap: ETL scheduling
Roadmap: HITL review
MIT Licensed

View changelog →

Features

Everything you need to
structure unstructured data

Multi-Format Ingestion

Parse PDFs, images, HTML, DOCX, XLSX, PPTX, emails, audio transcripts, and plain text through a unified async interface.

Schema-Driven Extraction

Define any Pydantic model as your extraction target. Built-in schemas for key-value, table, entity, form, classification, and summary.

vLLM-First Inference

Built for vLLM — guided_json constrained decoding guarantees schema-valid output at high throughput. OpenAI, Anthropic, Gemini, and Ollama available as drop-in alternatives.

RAG & Vector Indexing

Built-in DocumentReader, VectorStoreIndex, and QueryEngine. Plug in ChromaDB, pgvector, or any custom vector store.

High-Throughput Batch

Async-first pipeline with concurrent batch processing. Run hundreds of documents in parallel with configurable concurrency limits.

Validated Output

Automatic schema validation on every extraction. LLM retry with refined prompts on failure. Zero invalid JSON reaching your application.

Quick Start

From file to
structured JSON
in 5 lines

Define a Pydantic model as your extraction schema, pass any file, get back validated results — automatically retried if the model output is invalid.

Auto-detects file format (PDF, image, HTML, ...)

Routes to the optimal model for your task

Validates output against your schema, retries on failure

Read the guide →

extract_invoice.py

import asyncio
from pathlib import Path
from pydantic import BaseModel
from structure_d.pipeline import Pipeline

class Invoice(BaseModel):
    vendor: str
    total: float
    line_items: list[str]

async def main():
    pipeline = Pipeline(schema_cls=Invoice)
    results = await pipeline.run(Path("invoice.pdf"))
    print(results[0].structured_output)
    # {"vendor": "Acme Corp", "total": 1240.0}

asyncio.run(main())

Architecture

A modular six-stage pipeline

Each stage is independently configurable and replaceable — override any component with your own implementation.

Ingest

Preprocess

Route

LLM

Infer

Validate

Store

Supported input formats

PDFPNG/JPGHTMLDOCXXLSXPPTXEmailAudioCSVMarkdownPlain Text

Extract structured data from any document