API Reference

Complete API reference for all Structure-D modules, classes, and functions.

Structure-D's API surface is organized into focused modules. Each module can be imported independently — install only the extras you need.

Module map

Import paths

Quick reference for common imports:

python
# Pipeline
from structure_d.pipeline import Pipeline

# Providers
from structure_d.inference.providers import (
    OpenAIProvider, AnthropicProvider, GeminiProvider,
    OllamaProvider, VLLMProvider, FallbackProvider,
    get_provider, resolve_provider,
)

# Schemas
from structure_d.schemas.generic import get_schema, BUILTIN_SCHEMAS
from structure_d.schemas.base import (
    TaskType, DocumentFormat, ExtractionResult,
    ParsedDocument, TextChunk, DocumentMetadata,
)

# Ingestion
from structure_d.ingestion.manager import IngestionManager, build_default_registry
from structure_d.ingestion.base import BaseParser, ParserRegistry

# Preprocessing
from structure_d.preprocessing.chunker import Chunker
from structure_d.preprocessing.normalizer import normalize_text

# Validation
from structure_d.validation.validator import SchemaValidator
from structure_d.validation.retry import RetryHandler

# Indexing (RAG)
from structure_d.indexing import (
    DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine
)

# Storage
from structure_d.storage.jsonl import JSONLWriter, save_as_jsonl
from structure_d.storage.csv_store import CSVWriter, save_as_csv

# Config
from structure_d.config import load_settings, get_settings

# Exceptions
from structure_d.exceptions import (
    StructureDError, ParserError, ValidationError,
    InferenceError, ConfigurationError, ModelRoutingError,
)

Type reference

Quick lookup for the most commonly used types across the Structure-D API surface.

Type Module Description
Pipeline structure_d.pipeline Main entry point. Wires all six stages together.
ExtractionResult structure_d.schemas.base Output of a single chunk extraction — contains structured_output, is_valid, latency_ms, token_usage.
ParsedDocument structure_d.schemas.base Raw output of a parser — text, page list, tables, images, and DocumentMetadata.
TextChunk structure_d.schemas.base A single text segment from the Chunker — includes chunk_index, token_count, page_number.
DocumentMetadata structure_d.schemas.base File-level metadata attached to every ParsedDocument.
TaskType structure_d.schemas.base Enum: EXTRACTION, CLASSIFICATION, SUMMARISATION, ENTITY_EXTRACTION, TABLE_EXTRACTION.
DocumentFormat structure_d.schemas.base Enum: PDF, IMAGE, HTML, DOCX, XLSX, PPTX, EMAIL, PLAIN_TEXT, AUDIO_TRANSCRIPT.
BaseLLMProvider structure_d.inference.providers Abstract base for all LLM providers. Implement generate() to add a new provider.
ProviderResult structure_d.inference.providers Raw return from a provider call — content, model, usage, latency_ms.
ModelEntry structure_d.models.registry Registry entry: alias, model_id, provider, supported_tasks, context_length.
Settings structure_d.config Root Pydantic config model loaded from YAML. Nested: inference, ingestion, preprocessing, validation, storage.

Supported formats

Every file extension Structure-D can ingest, with the default parser and what it extracts.

Extension Format enum Default parser Extracts
.pdf (text)PDFpymupdfText per page, tables, metadata
.pdf (text, alt)PDFpdfplumberAccurate table detection
.pdf (scanned)PDFocr_pdfOCR via Tesseract
.png .jpg .tiffIMAGEtesseract_imageOCR text from raster images
.html .htmHTMLhtmlCleaned body text, tables, links
.docxDOCXdocxParagraphs, styles, tables
.xlsx .xlsXLSXxlsxAll sheet data as string matrix
.pptxPPTXpptxSlide text and speaker notes
.eml .msgEMAILemailSubject, body, headers, attachments list
.txt .md .rstPLAIN_TEXTplaintextRaw text
.csvPLAIN_TEXTplaintextRaw text (CSV rows as text)
.vtt .srtAUDIO_TRANSCRIPTtranscriptCaption text with timestamps

Exception hierarchy

All exceptions extend StructureDError which carries a .message and a .context dict:

python
StructureDError(Exception)           # base
├── ParserError                      # .file_path, .parser_name, .format
├── ValidationError                  # .schema_name, .validation_errors, .raw_output
├── InferenceError                   # .model, .status_code, .response_body
├── ConfigurationError               # .config_key, .config_path
├── ModelRoutingError                # .task, .available_models
├── StorageError                     # .storage_type, .file_path
└── RetrievalError                   # .vector_store, .operation
python
from structure_d.exceptions import InferenceError, ParserError

try:
    results = await pipeline.run(Path("doc.pdf"))
except ParserError as e:
    print(f"Parser failed: {e.parser_name} on {e.file_path}")
except InferenceError as e:
    print(f"Model error: {e.model} — HTTP {e.status_code}")
except StructureDError as e:
    print(f"Pipeline error: {e.message}, context={e.context}")