Configuration

Configure inference providers, parsers, chunking strategies, validation retry, and storage destinations.

Config file

Structure-D loads configuration from a YAML file. By default it reads configs/default.yaml relative to the working directory.

Pass a custom path when creating the pipeline:

python
pipeline = Pipeline(
    schema_cls=MySchema,
    config_path="configs/production.yaml",
)

Or load the settings object directly:

python
from structure_d.config import load_settings

settings = load_settings("configs/production.yaml")

Environment variables

Every setting can be overridden via environment variables using the SD_ prefix and __ as a nested delimiter.

Env var YAML path Example
SD_LOG_LEVELlog_levelDEBUG
SD_INFERENCE__PROVIDER__PROVIDERinference.provider.provideropenai
SD_INFERENCE__PROVIDER__VLLM__API_BASEinference.provider.vllm.api_basehttp://localhost:8000
SD_VALIDATION__MAX_RETRIESvalidation.max_retries5
SD_STORAGE__DEFAULT_FORMATstorage.default_formatcsv, markdown

Inference providers

Structure-D is built for vLLM as its primary and recommended inference engine. vLLM's guided_json constrained decoding makes it physically impossible for the model to return output that does not match your schema, eliminating validation failures at the source. Its PagedAttention architecture and continuous batching allow hundreds of concurrent extractions on a single GPU.

All other providers (OpenAI, Anthropic, Gemini, Ollama) are fully supported drop-in alternatives — useful for prototyping, cloud-only environments, or when a frontier model's reasoning is more important than throughput. They share the same BaseLLMProvider interface and the same validation and retry logic.

Provider Config key Structured output method Best for
vLLM (default)vllmguided_json constrained decodingProduction, high-throughput, self-hosted GPU
OpenAIopenairesponse_format: json_schemaCloud, GPT-4o, Azure OpenAI
AnthropicanthropicTool-use structured outputCloud, Claude 3.5
GeminigeminiStructured generationCloud, Google ecosystem
OllamaollamaJSON modeLocal dev, no GPU server required
configs/default.yaml yaml
inference:
  provider:
    provider: "vllm"                  # PRIMARY — vllm | openai | anthropic | gemini | ollama
    fallback_provider: "anthropic"    # fallback on InferenceError (optional)

    vllm:
      api_base: "http://localhost:8000"
      api_key: null
      model: null                     # uses model routing if null

    openai:
      api_key: null                   # reads OPENAI_API_KEY from env
      model: "gpt-4o"
      base_url: null                  # override for Azure OpenAI

    anthropic:
      api_key: null                   # reads ANTHROPIC_API_KEY
      model: "claude-3-5-sonnet-20241022"

    gemini:
      api_key: null                   # reads GOOGLE_API_KEY
      model: "gemini-1.5-pro"

    ollama:
      base_url: "http://localhost:11434"
      model: "llama3.1:8b"

Automatic fallback

When fallback_provider is set, Structure-D wraps both providers in a FallbackProvider that silently retries on InferenceError. No code changes required.

Ingestion & parsing

yaml
ingestion:
  default_parser: "auto"      # auto | pymupdf | pdfplumber | unstructured | docling
  ocr_engine: "tesseract"     # tesseract | easyocr
  max_concurrent: 4           # parallel ingest workers
  timeout_seconds: 30

  connectors:
    type: "local"             # local | s3 | gcs | azure | sftp
    # S3 example:
    # type: "s3"
    # bucket: "my-bucket"
    # prefix: "documents/"
    # region: "us-east-1"

Preprocessing

yaml
preprocessing:
  normalize:
    normalize_unicode: true
    strip_boilerplate: true    # remove page numbers, headers/footers
    collapse_whitespace: true

  chunking:
    strategy: "semantic"       # fixed | sentence | heading | semantic
    max_tokens: 1024
    overlap_tokens: 128
    heading_level: 2           # for "heading" strategy: H1, H2, etc.
Strategy Description Best for
fixed Split every N tokens with overlap Simple documents, maximum control
sentence Split on sentence boundaries Prose, articles
heading Split on Markdown/document headings Structured reports, documentation
semantic Heading-aware sentence chunking (default) Mixed documents, best overall quality

Validation & retry

yaml
validation:
  max_retries: 3          # LLM retries on schema validation failure
  strict_mode: false      # if true, raises on first validation error

On each failed validation attempt, Structure-D sends the model a refined prompt that includes the original response, the validation errors, and the schema definition — significantly improving recovery rates on subsequent attempts.

Storage & output formats

yaml
storage:
  default_format: "jsonl"     # jsonl | csv | markdown | database
  output_directory: "./output"

  jsonl:
    indent: 2                 # null = compact single-line, 2 = pretty-printed

  csv:
    delimiter: ","
    quoting: "minimal"

  markdown:
    enabled: true             # human-readable .md output — one ## section per result

  database:
    connection_string: null   # e.g. "postgresql+asyncpg://user:pass@host/db"
    table_prefix: "sd_"
Format File extension Best for
jsonl .jsonl Pipelines, log ingestion, re-processing. Pretty-printed by default (indent: 2).
csv .csv Spreadsheets, BI tools. structured_output fields flattened to dot-notation columns.
markdown .md Human review, docs-as-data, Git diffs. Each result rendered as ## Result N with a metadata table and fenced JSON block.
database PostgreSQL via asyncpg. Requires pip install "structure-d[storage]".

Full reference

The complete default configuration with all available settings lives at configs/default.yaml in the repository.