Configuration

Configure inference providers, parsers, chunking strategies, validation retry, and storage destinations.

Config file

Structure-D loads configuration from a YAML file. By default it reads configs/default.yaml relative to the working directory.

Pass a custom path when creating the pipeline:

python

pipeline = Pipeline(
    schema_cls=MySchema,
    config_path="configs/production.yaml",
)

Or load the settings object directly:

python

from structure_d.config import load_settings

settings = load_settings("configs/production.yaml")

Environment variables

Every setting can be overridden via environment variables using the SD_ prefix and __ as a nested delimiter.

Env var	YAML path	Example
`SD_LOG_LEVEL`	`log_level`	`DEBUG`
`SD_INFERENCE__PROVIDER__PROVIDER`	`inference.provider.provider`	`openai`
`SD_INFERENCE__PROVIDER__VLLM__API_BASE`	`inference.provider.vllm.api_base`	`http://localhost:8000`
`SD_VALIDATION__MAX_RETRIES`	`validation.max_retries`	`5`
`SD_STORAGE__DEFAULT_FORMAT`	`storage.default_format`	`csv`, `markdown`

Inference providers

Structure-D is built for vLLM as its primary and recommended inference engine. vLLM's guided_json constrained decoding makes it physically impossible for the model to return output that does not match your schema, eliminating validation failures at the source. Its PagedAttention architecture and continuous batching allow hundreds of concurrent extractions on a single GPU.

All other providers (OpenAI, Anthropic, Gemini, Ollama) are fully supported drop-in alternatives — useful for prototyping, cloud-only environments, or when a frontier model's reasoning is more important than throughput. They share the same BaseLLMProvider interface and the same validation and retry logic.

Provider	Config key	Structured output method	Best for
vLLM (default)	`vllm`	`guided_json` constrained decoding	Production, high-throughput, self-hosted GPU
OpenAI	`openai`	`response_format: json_schema`	Cloud, GPT-4o, Azure OpenAI
Anthropic	`anthropic`	Tool-use structured output	Cloud, Claude 3.5
Gemini	`gemini`	Structured generation	Cloud, Google ecosystem
Ollama	`ollama`	JSON mode	Local dev, no GPU server required

configs/default.yaml yaml

inference:
  provider:
    provider: "vllm"                  # PRIMARY — vllm | openai | anthropic | gemini | ollama
    fallback_provider: "anthropic"    # fallback on InferenceError (optional)

    vllm:
      api_base: "http://localhost:8000"
      api_key: null
      model: null                     # uses model routing if null

    openai:
      api_key: null                   # reads OPENAI_API_KEY from env
      model: "gpt-4o"
      base_url: null                  # override for Azure OpenAI

    anthropic:
      api_key: null                   # reads ANTHROPIC_API_KEY
      model: "claude-3-5-sonnet-20241022"

    gemini:
      api_key: null                   # reads GOOGLE_API_KEY
      model: "gemini-1.5-pro"

    ollama:
      base_url: "http://localhost:11434"
      model: "llama3.1:8b"

Automatic fallback

When fallback_provider is set, Structure-D wraps both providers in a FallbackProvider that silently retries on InferenceError. No code changes required.

Ingestion & parsing

yaml

ingestion:
  default_parser: "auto"      # auto | pymupdf | pdfplumber | unstructured | docling
  ocr_engine: "tesseract"     # tesseract | easyocr
  max_concurrent: 4           # parallel ingest workers
  timeout_seconds: 30

  connectors:
    type: "local"             # local | s3 | gcs | azure | sftp
    # S3 example:
    # type: "s3"
    # bucket: "my-bucket"
    # prefix: "documents/"
    # region: "us-east-1"

Preprocessing

yaml

preprocessing:
  normalize:
    normalize_unicode: true
    strip_boilerplate: true    # remove page numbers, headers/footers
    collapse_whitespace: true

  chunking:
    strategy: "semantic"       # fixed | sentence | heading | semantic
    max_tokens: 1024
    overlap_tokens: 128
    heading_level: 2           # for "heading" strategy: H1, H2, etc.

Strategy	Description	Best for
`fixed`	Split every N tokens with overlap	Simple documents, maximum control
`sentence`	Split on sentence boundaries	Prose, articles
`heading`	Split on Markdown/document headings	Structured reports, documentation
`semantic`	Heading-aware sentence chunking (default)	Mixed documents, best overall quality

Validation & retry

yaml

validation:
  max_retries: 3          # LLM retries on schema validation failure
  strict_mode: false      # if true, raises on first validation error

On each failed validation attempt, Structure-D sends the model a refined prompt that includes the original response, the validation errors, and the schema definition — significantly improving recovery rates on subsequent attempts.

Storage & output formats

yaml

storage:
  default_format: "jsonl"     # jsonl | csv | markdown | database
  output_directory: "./output"

  jsonl:
    indent: 2                 # null = compact single-line, 2 = pretty-printed

  csv:
    delimiter: ","
    quoting: "minimal"

  markdown:
    enabled: true             # human-readable .md output — one ## section per result

  database:
    connection_string: null   # e.g. "postgresql+asyncpg://user:pass@host/db"
    table_prefix: "sd_"

Format	File extension	Best for
`jsonl`	`.jsonl`	Pipelines, log ingestion, re-processing. Pretty-printed by default (`indent: 2`).
`csv`	`.csv`	Spreadsheets, BI tools. `structured_output` fields flattened to dot-notation columns.
`markdown`	`.md`	Human review, docs-as-data, Git diffs. Each result rendered as `## Result N` with a metadata table and fenced JSON block.
`database`	—	PostgreSQL via asyncpg. Requires `pip install "structure-d[storage]"`.

Full reference

The complete default configuration with all available settings lives at configs/default.yaml in the repository.

Edit this page on GitHub