Configuration
Configure inference providers, parsers, chunking strategies, validation retry, and storage destinations.
Config file
Structure-D loads configuration from a YAML file. By default it reads
configs/default.yaml relative to the working directory.
Pass a custom path when creating the pipeline:
pipeline = Pipeline(
schema_cls=MySchema,
config_path="configs/production.yaml",
) Or load the settings object directly:
from structure_d.config import load_settings
settings = load_settings("configs/production.yaml") Environment variables
Every setting can be overridden via environment variables using the SD_ prefix
and __ as a nested delimiter.
| Env var | YAML path | Example |
|---|---|---|
SD_LOG_LEVEL | log_level | DEBUG |
SD_INFERENCE__PROVIDER__PROVIDER | inference.provider.provider | openai |
SD_INFERENCE__PROVIDER__VLLM__API_BASE | inference.provider.vllm.api_base | http://localhost:8000 |
SD_VALIDATION__MAX_RETRIES | validation.max_retries | 5 |
SD_STORAGE__DEFAULT_FORMAT | storage.default_format | csv, markdown |
Inference providers
Structure-D is built for vLLM
as its primary and recommended inference engine. vLLM's guided_json constrained decoding makes it
physically impossible for the model to return output that does not match your schema, eliminating validation failures
at the source. Its PagedAttention architecture and continuous batching allow hundreds of concurrent extractions on a
single GPU.
All other providers (OpenAI, Anthropic, Gemini, Ollama) are fully supported drop-in alternatives —
useful for prototyping, cloud-only environments, or when a frontier model's reasoning is more important than throughput.
They share the same BaseLLMProvider interface and the same validation and retry logic.
| Provider | Config key | Structured output method | Best for |
|---|---|---|---|
| vLLM (default) | vllm | guided_json constrained decoding | Production, high-throughput, self-hosted GPU |
| OpenAI | openai | response_format: json_schema | Cloud, GPT-4o, Azure OpenAI |
| Anthropic | anthropic | Tool-use structured output | Cloud, Claude 3.5 |
| Gemini | gemini | Structured generation | Cloud, Google ecosystem |
| Ollama | ollama | JSON mode | Local dev, no GPU server required |
inference:
provider:
provider: "vllm" # PRIMARY — vllm | openai | anthropic | gemini | ollama
fallback_provider: "anthropic" # fallback on InferenceError (optional)
vllm:
api_base: "http://localhost:8000"
api_key: null
model: null # uses model routing if null
openai:
api_key: null # reads OPENAI_API_KEY from env
model: "gpt-4o"
base_url: null # override for Azure OpenAI
anthropic:
api_key: null # reads ANTHROPIC_API_KEY
model: "claude-3-5-sonnet-20241022"
gemini:
api_key: null # reads GOOGLE_API_KEY
model: "gemini-1.5-pro"
ollama:
base_url: "http://localhost:11434"
model: "llama3.1:8b" Automatic fallback
When fallback_provider is set, Structure-D wraps both providers in a FallbackProvider that silently retries on InferenceError. No code changes required.
Ingestion & parsing
ingestion:
default_parser: "auto" # auto | pymupdf | pdfplumber | unstructured | docling
ocr_engine: "tesseract" # tesseract | easyocr
max_concurrent: 4 # parallel ingest workers
timeout_seconds: 30
connectors:
type: "local" # local | s3 | gcs | azure | sftp
# S3 example:
# type: "s3"
# bucket: "my-bucket"
# prefix: "documents/"
# region: "us-east-1" Preprocessing
preprocessing:
normalize:
normalize_unicode: true
strip_boilerplate: true # remove page numbers, headers/footers
collapse_whitespace: true
chunking:
strategy: "semantic" # fixed | sentence | heading | semantic
max_tokens: 1024
overlap_tokens: 128
heading_level: 2 # for "heading" strategy: H1, H2, etc. | Strategy | Description | Best for |
|---|---|---|
fixed | Split every N tokens with overlap | Simple documents, maximum control |
sentence | Split on sentence boundaries | Prose, articles |
heading | Split on Markdown/document headings | Structured reports, documentation |
semantic | Heading-aware sentence chunking (default) | Mixed documents, best overall quality |
Validation & retry
validation:
max_retries: 3 # LLM retries on schema validation failure
strict_mode: false # if true, raises on first validation error On each failed validation attempt, Structure-D sends the model a refined prompt that includes the original response, the validation errors, and the schema definition — significantly improving recovery rates on subsequent attempts.
Storage & output formats
storage:
default_format: "jsonl" # jsonl | csv | markdown | database
output_directory: "./output"
jsonl:
indent: 2 # null = compact single-line, 2 = pretty-printed
csv:
delimiter: ","
quoting: "minimal"
markdown:
enabled: true # human-readable .md output — one ## section per result
database:
connection_string: null # e.g. "postgresql+asyncpg://user:pass@host/db"
table_prefix: "sd_" | Format | File extension | Best for |
|---|---|---|
jsonl | .jsonl | Pipelines, log ingestion, re-processing. Pretty-printed by default (indent: 2). |
csv | .csv | Spreadsheets, BI tools. structured_output fields flattened to dot-notation columns. |
markdown | .md | Human review, docs-as-data, Git diffs. Each result rendered as ## Result N with a metadata table and fenced JSON block. |
database | — | PostgreSQL via asyncpg. Requires pip install "structure-d[storage]". |
Full reference
The complete default configuration with all available settings lives at
configs/default.yaml in the repository.