Changelog

Release history

All notable changes to Structure-D are documented here.
Follows Keep a Changelog format.

v0.2.0

March 2026 Latest

Adds the full RAG indexing layer, a high-performance Rust CLI, cloud destination writers, provider fallback chaining, and observability with Prometheus + OpenTelemetry.

RAG

Retrieval-Augmented Generation

New DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine, and RAGPipeline — compatible with the LlamaIndex API.

CLI

Rust CLI

Native binary with extract, batch, formats, models, schemas, config, and providers subcommands. Zero Python overhead for batch jobs.

FBK

FallbackProvider

Chain two providers transparently — primary runs first, fallback takes over on InferenceError. Configurable via YAML or code.

All changes

Added RAG indexing layer: DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine, RAGPipeline
Added LlamaIndex-compatible API — load_and_chunk(), as_query_engine(), insert_nodes()
Added Rust CLI with extract, batch, formats, models, schemas, config, providers subcommands
Added FallbackProvider for transparent two-provider chaining on InferenceError
Added Cloud destination writers: Snowflake, BigQuery, MySQL, Redshift
Added Monitoring: Prometheus metrics endpoint + OpenTelemetry tracing exporter
Improved Chunker — new semantic strategy: heading-aware sentence splitting
Improved SchemaValidator — multi-step extraction: JSON → code fence → block heuristics fallback
Improved ModelRegistry — added deepseek-r1-70b and qwen-vl-7b (multimodal) entries
Improved Pipeline.run_many() — now returns dict keyed by filename for easier iteration
Fixed OCR parser encoding issue on non-ASCII (CJK, Arabic) documents
Fixed EmailParser failing on malformed MIME boundaries
Fixed VLLMProvider not sending system_prompt correctly on chat-template models
Fixed CSVWriter writing None as string "None" instead of empty field

v0.1.0

January 2026 Stable

View on GitHub ↗

Initial release. The full six-stage pipeline, 14 format parsers, five LLM providers, eight built-in schemas, model routing, source connectors, and a FastAPI service.

PIP

Core pipeline

Ingestion → Preprocessing → Routing → Inference → Validation → Storage. Fully async throughout.

FMT

14 format parsers

PyMuPDF, PDFPlumber, OCR-PDF, Tesseract image, HTML, DOCX, XLSX, PPTX, email, transcript, plain text.

SCH

Schema-driven

Eight built-in Pydantic schemas and a get_schema() registry. Bring any BaseModel as your extraction target.

All changes

Added Core six-stage pipeline: Ingestion → Preprocessing → Routing → Inference → Validation → Storage
Added Format parsers: pymupdf, pdfplumber, ocr_pdf, tesseract_image, html, docx, xlsx, pptx, email, transcript, plaintext
Added LLM providers: VLLMProvider (guided_json), OpenAIProvider, AnthropicProvider (tool-use), GeminiProvider, OllamaProvider
Added Built-in schemas: key_value, table, entity, classification, summary, form, document_structure, generic
Added ModelRegistry with 7 pre-registered models and task-based ModelRouter
Added Source connectors: LocalConnector, S3Connector, GCSConnector, AzureConnector, SFTPConnector
Added FastAPI service with /extract, /health, /models, /schemas, /formats endpoints
Added Storage writers: JSONLWriter, CSVWriter
Added BatchProcessor with configurable concurrency via asyncio.Semaphore
Added SchemaValidator + RetryHandler — up to max_retries corrective retry attempts
Added Chunker with fixed, sentence, heading, and semantic strategies
Added Settings and YAML config loader with SD_* environment variable overrides
Added 110 unit tests across pipeline, parsers, validators, storage, API, and schemas

Upcoming

What's next

The roadmap is open — contributions and feedback welcome on GitHub Discussions.

v0.3.0 Q2 2026

ETL Pipeline Framework

Scheduled batch processing with cron-style triggers
YAML workflow definitions (sources, transforms, destinations)
Per-field success rates, fill rates, and cost tracking
Resume failed batches from checkpoint
Native Airflow / Prefect operator

v0.4.0 Q3 2026

Human-in-the-Loop

HITL review queue for low-confidence extractions
API authentication (API key + JWT) + rate limiting
React-based monitoring dashboard
Versioned prompt template management
Feedback loop — approved corrections update model routing heuristics

v0.5.0 Q4 2026

Multi-Modal & Agents

Native vision models (GPT-4o, Claude Sonnet vision, Qwen-VL)
Table-from-image extraction without OCR
Agent-mode extraction — model requests specific page regions
Streaming output support

Stay updated

Watch the GitHub repository to get notified of new releases.

Watch on GitHub →