API Reference

Complete API reference for all Structure-D modules, classes, and functions.

Structure-D's API surface is organized into focused modules. Each module can be imported independently — install only the extras you need.

Module map

structure_d.pipeline.orchestrator Pipeline Main entry point — run(), run_many(), build_index() structure_d.inference.providers Providers OpenAI, Anthropic, Gemini, Ollama, vLLM, FallbackProvider structure_d.schemas.generic Schemas BUILTIN_SCHEMAS, get_schema(), GenericExtraction, KeyValueExtraction… structure_d.schemas.base Base Models TaskType, DocumentFormat, ExtractionResult, ParsedDocument, TextChunk structure_d.ingestion Ingestion IngestionManager, BaseParser, ParserRegistry, connectors structure_d.preprocessing Preprocessing Chunker, normalize_text() structure_d.validation Validation SchemaValidator, RetryHandler structure_d.indexing Indexing DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine structure_d.retrieval Retrieval ChromaVectorStore, PGVectorStore, EmbeddingService structure_d.models Model Registry ModelRegistry, ModelRouter, ModelEntry structure_d.storage Storage JSONLWriter, CSVWriter, DatabaseWriter, cloud destinations structure_d.config Config Settings, load_settings(), get_settings() structure_d.api FastAPI create_app(), routes, request/response models structure_d.exceptions Exceptions StructureDError hierarchy

Import paths

Quick reference for common imports:

python

# Pipeline
from structure_d.pipeline import Pipeline

# Providers
from structure_d.inference.providers import (
    OpenAIProvider, AnthropicProvider, GeminiProvider,
    OllamaProvider, VLLMProvider, FallbackProvider,
    get_provider, resolve_provider,
)

# Schemas
from structure_d.schemas.generic import get_schema, BUILTIN_SCHEMAS
from structure_d.schemas.base import (
    TaskType, DocumentFormat, ExtractionResult,
    ParsedDocument, TextChunk, DocumentMetadata,
)

# Ingestion
from structure_d.ingestion.manager import IngestionManager, build_default_registry
from structure_d.ingestion.base import BaseParser, ParserRegistry

# Preprocessing
from structure_d.preprocessing.chunker import Chunker
from structure_d.preprocessing.normalizer import normalize_text

# Validation
from structure_d.validation.validator import SchemaValidator
from structure_d.validation.retry import RetryHandler

# Indexing (RAG)
from structure_d.indexing import (
    DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine
)

# Storage
from structure_d.storage.jsonl import JSONLWriter, save_as_jsonl
from structure_d.storage.csv_store import CSVWriter, save_as_csv

# Config
from structure_d.config import load_settings, get_settings

# Exceptions
from structure_d.exceptions import (
    StructureDError, ParserError, ValidationError,
    InferenceError, ConfigurationError, ModelRoutingError,
)

Type reference

Quick lookup for the most commonly used types across the Structure-D API surface.

Type	Module	Description
`Pipeline`	`structure_d.pipeline`	Main entry point. Wires all six stages together.
`ExtractionResult`	`structure_d.schemas.base`	Output of a single chunk extraction — contains `structured_output`, `is_valid`, `latency_ms`, `token_usage`.
`ParsedDocument`	`structure_d.schemas.base`	Raw output of a parser — text, page list, tables, images, and `DocumentMetadata`.
`TextChunk`	`structure_d.schemas.base`	A single text segment from the Chunker — includes `chunk_index`, `token_count`, `page_number`.
`DocumentMetadata`	`structure_d.schemas.base`	File-level metadata attached to every `ParsedDocument`.
`TaskType`	`structure_d.schemas.base`	Enum: `EXTRACTION`, `CLASSIFICATION`, `SUMMARISATION`, `ENTITY_EXTRACTION`, `TABLE_EXTRACTION`.
`DocumentFormat`	`structure_d.schemas.base`	Enum: `PDF`, `IMAGE`, `HTML`, `DOCX`, `XLSX`, `PPTX`, `EMAIL`, `PLAIN_TEXT`, `AUDIO_TRANSCRIPT`.
`BaseLLMProvider`	`structure_d.inference.providers`	Abstract base for all LLM providers. Implement `generate()` to add a new provider.
`ProviderResult`	`structure_d.inference.providers`	Raw return from a provider call — `content`, `model`, `usage`, `latency_ms`.
`ModelEntry`	`structure_d.models.registry`	Registry entry: `alias`, `model_id`, `provider`, `supported_tasks`, `context_length`.
`Settings`	`structure_d.config`	Root Pydantic config model loaded from YAML. Nested: `inference`, `ingestion`, `preprocessing`, `validation`, `storage`.

Supported formats

Every file extension Structure-D can ingest, with the default parser and what it extracts.

Extension	Format enum	Default parser	Extracts
`.pdf` (text)	`PDF`	`pymupdf`	Text per page, tables, metadata
`.pdf` (text, alt)	`PDF`	`pdfplumber`	Accurate table detection
`.pdf` (scanned)	`PDF`	`ocr_pdf`	OCR via Tesseract
`.png` `.jpg` `.tiff`	`IMAGE`	`tesseract_image`	OCR text from raster images
`.html` `.htm`	`HTML`	`html`	Cleaned body text, tables, links
`.docx`	`DOCX`	`docx`	Paragraphs, styles, tables
`.xlsx` `.xls`	`XLSX`	`xlsx`	All sheet data as string matrix
`.pptx`	`PPTX`	`pptx`	Slide text and speaker notes
`.eml` `.msg`	`EMAIL`	`email`	Subject, body, headers, attachments list
`.txt` `.md` `.rst`	`PLAIN_TEXT`	`plaintext`	Raw text
`.csv`	`PLAIN_TEXT`	`plaintext`	Raw text (CSV rows as text)
`.vtt` `.srt`	`AUDIO_TRANSCRIPT`	`transcript`	Caption text with timestamps

Exception hierarchy

All exceptions extend StructureDError which carries a .message and a .context dict:

python

StructureDError(Exception)           # base
├── ParserError                      # .file_path, .parser_name, .format
├── ValidationError                  # .schema_name, .validation_errors, .raw_output
├── InferenceError                   # .model, .status_code, .response_body
├── ConfigurationError               # .config_key, .config_path
├── ModelRoutingError                # .task, .available_models
├── StorageError                     # .storage_type, .file_path
└── RetrievalError                   # .vector_store, .operation

python

from structure_d.exceptions import InferenceError, ParserError

try:
    results = await pipeline.run(Path("doc.pdf"))
except ParserError as e:
    print(f"Parser failed: {e.parser_name} on {e.file_path}")
except InferenceError as e:
    print(f"Model error: {e.model} — HTTP {e.status_code}")
except StructureDError as e:
    print(f"Pipeline error: {e.message}, context={e.context}")

Edit this page on GitHub