API Reference
Complete API reference for all Structure-D modules, classes, and functions.
Structure-D's API surface is organized into focused modules. Each module can be imported independently — install only the extras you need.
Module map
structure_d.pipeline.orchestrator Pipeline Main entry point — run(), run_many(), build_index() structure_d.inference.providers Providers OpenAI, Anthropic, Gemini, Ollama, vLLM, FallbackProvider structure_d.schemas.generic Schemas BUILTIN_SCHEMAS, get_schema(), GenericExtraction, KeyValueExtraction… structure_d.schemas.base Base Models TaskType, DocumentFormat, ExtractionResult, ParsedDocument, TextChunk structure_d.ingestion Ingestion IngestionManager, BaseParser, ParserRegistry, connectors structure_d.preprocessing Preprocessing Chunker, normalize_text() structure_d.validation Validation SchemaValidator, RetryHandler structure_d.indexing Indexing DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine structure_d.retrieval Retrieval ChromaVectorStore, PGVectorStore, EmbeddingService structure_d.models Model Registry ModelRegistry, ModelRouter, ModelEntry structure_d.storage Storage JSONLWriter, CSVWriter, DatabaseWriter, cloud destinations structure_d.config Config Settings, load_settings(), get_settings() structure_d.api FastAPI create_app(), routes, request/response models structure_d.exceptions Exceptions StructureDError hierarchy
Import paths
Quick reference for common imports:
python
# Pipeline
from structure_d.pipeline import Pipeline
# Providers
from structure_d.inference.providers import (
OpenAIProvider, AnthropicProvider, GeminiProvider,
OllamaProvider, VLLMProvider, FallbackProvider,
get_provider, resolve_provider,
)
# Schemas
from structure_d.schemas.generic import get_schema, BUILTIN_SCHEMAS
from structure_d.schemas.base import (
TaskType, DocumentFormat, ExtractionResult,
ParsedDocument, TextChunk, DocumentMetadata,
)
# Ingestion
from structure_d.ingestion.manager import IngestionManager, build_default_registry
from structure_d.ingestion.base import BaseParser, ParserRegistry
# Preprocessing
from structure_d.preprocessing.chunker import Chunker
from structure_d.preprocessing.normalizer import normalize_text
# Validation
from structure_d.validation.validator import SchemaValidator
from structure_d.validation.retry import RetryHandler
# Indexing (RAG)
from structure_d.indexing import (
DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine
)
# Storage
from structure_d.storage.jsonl import JSONLWriter, save_as_jsonl
from structure_d.storage.csv_store import CSVWriter, save_as_csv
# Config
from structure_d.config import load_settings, get_settings
# Exceptions
from structure_d.exceptions import (
StructureDError, ParserError, ValidationError,
InferenceError, ConfigurationError, ModelRoutingError,
) Type reference
Quick lookup for the most commonly used types across the Structure-D API surface.
| Type | Module | Description |
|---|---|---|
Pipeline | structure_d.pipeline | Main entry point. Wires all six stages together. |
ExtractionResult | structure_d.schemas.base | Output of a single chunk extraction — contains structured_output, is_valid, latency_ms, token_usage. |
ParsedDocument | structure_d.schemas.base | Raw output of a parser — text, page list, tables, images, and DocumentMetadata. |
TextChunk | structure_d.schemas.base | A single text segment from the Chunker — includes chunk_index, token_count, page_number. |
DocumentMetadata | structure_d.schemas.base | File-level metadata attached to every ParsedDocument. |
TaskType | structure_d.schemas.base | Enum: EXTRACTION, CLASSIFICATION, SUMMARISATION, ENTITY_EXTRACTION, TABLE_EXTRACTION. |
DocumentFormat | structure_d.schemas.base | Enum: PDF, IMAGE, HTML, DOCX, XLSX, PPTX, EMAIL, PLAIN_TEXT, AUDIO_TRANSCRIPT. |
BaseLLMProvider | structure_d.inference.providers | Abstract base for all LLM providers. Implement generate() to add a new provider. |
ProviderResult | structure_d.inference.providers | Raw return from a provider call — content, model, usage, latency_ms. |
ModelEntry | structure_d.models.registry | Registry entry: alias, model_id, provider, supported_tasks, context_length. |
Settings | structure_d.config | Root Pydantic config model loaded from YAML. Nested: inference, ingestion, preprocessing, validation, storage. |
Supported formats
Every file extension Structure-D can ingest, with the default parser and what it extracts.
| Extension | Format enum | Default parser | Extracts |
|---|---|---|---|
.pdf (text) | PDF | pymupdf | Text per page, tables, metadata |
.pdf (text, alt) | PDF | pdfplumber | Accurate table detection |
.pdf (scanned) | PDF | ocr_pdf | OCR via Tesseract |
.png .jpg .tiff | IMAGE | tesseract_image | OCR text from raster images |
.html .htm | HTML | html | Cleaned body text, tables, links |
.docx | DOCX | docx | Paragraphs, styles, tables |
.xlsx .xls | XLSX | xlsx | All sheet data as string matrix |
.pptx | PPTX | pptx | Slide text and speaker notes |
.eml .msg | EMAIL | email | Subject, body, headers, attachments list |
.txt .md .rst | PLAIN_TEXT | plaintext | Raw text |
.csv | PLAIN_TEXT | plaintext | Raw text (CSV rows as text) |
.vtt .srt | AUDIO_TRANSCRIPT | transcript | Caption text with timestamps |
Exception hierarchy
All exceptions extend StructureDError which carries a .message and a .context dict:
python
StructureDError(Exception) # base
├── ParserError # .file_path, .parser_name, .format
├── ValidationError # .schema_name, .validation_errors, .raw_output
├── InferenceError # .model, .status_code, .response_body
├── ConfigurationError # .config_key, .config_path
├── ModelRoutingError # .task, .available_models
├── StorageError # .storage_type, .file_path
└── RetrievalError # .vector_store, .operation python
from structure_d.exceptions import InferenceError, ParserError
try:
results = await pipeline.run(Path("doc.pdf"))
except ParserError as e:
print(f"Parser failed: {e.parser_name} on {e.file_path}")
except InferenceError as e:
print(f"Model error: {e.model} — HTTP {e.status_code}")
except StructureDError as e:
print(f"Pipeline error: {e.message}, context={e.context}")