Changelog

Release history

All notable changes to Structure-D are documented here.
Follows Keep a Changelog format.

v0.2.0

Latest
View on GitHub ↗

Adds the full RAG indexing layer, a high-performance Rust CLI, cloud destination writers, provider fallback chaining, and observability with Prometheus + OpenTelemetry.

RAG

Retrieval-Augmented Generation

New DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine, and RAGPipeline — compatible with the LlamaIndex API.

CLI

Rust CLI

Native binary with extract, batch, formats, models, schemas, config, and providers subcommands. Zero Python overhead for batch jobs.

FBK

FallbackProvider

Chain two providers transparently — primary runs first, fallback takes over on InferenceError. Configurable via YAML or code.

All changes

  • Added RAG indexing layer: DocumentReader, VectorStoreIndex, SummaryIndex, QueryEngine, RAGPipeline
  • Added LlamaIndex-compatible API — load_and_chunk(), as_query_engine(), insert_nodes()
  • Added Rust CLI with extract, batch, formats, models, schemas, config, providers subcommands
  • Added FallbackProvider for transparent two-provider chaining on InferenceError
  • Added Cloud destination writers: Snowflake, BigQuery, MySQL, Redshift
  • Added Monitoring: Prometheus metrics endpoint + OpenTelemetry tracing exporter
  • Improved Chunker — new semantic strategy: heading-aware sentence splitting
  • Improved SchemaValidator — multi-step extraction: JSON → code fence → block heuristics fallback
  • Improved ModelRegistry — added deepseek-r1-70b and qwen-vl-7b (multimodal) entries
  • Improved Pipeline.run_many() — now returns dict keyed by filename for easier iteration
  • Fixed OCR parser encoding issue on non-ASCII (CJK, Arabic) documents
  • Fixed EmailParser failing on malformed MIME boundaries
  • Fixed VLLMProvider not sending system_prompt correctly on chat-template models
  • Fixed CSVWriter writing None as string "None" instead of empty field

v0.1.0

Stable
View on GitHub ↗

Initial release. The full six-stage pipeline, 14 format parsers, five LLM providers, eight built-in schemas, model routing, source connectors, and a FastAPI service.

PIP

Core pipeline

Ingestion → Preprocessing → Routing → Inference → Validation → Storage. Fully async throughout.

FMT

14 format parsers

PyMuPDF, PDFPlumber, OCR-PDF, Tesseract image, HTML, DOCX, XLSX, PPTX, email, transcript, plain text.

SCH

Schema-driven

Eight built-in Pydantic schemas and a get_schema() registry. Bring any BaseModel as your extraction target.

All changes

  • Added Core six-stage pipeline: Ingestion → Preprocessing → Routing → Inference → Validation → Storage
  • Added Format parsers: pymupdf, pdfplumber, ocr_pdf, tesseract_image, html, docx, xlsx, pptx, email, transcript, plaintext
  • Added LLM providers: VLLMProvider (guided_json), OpenAIProvider, AnthropicProvider (tool-use), GeminiProvider, OllamaProvider
  • Added Built-in schemas: key_value, table, entity, classification, summary, form, document_structure, generic
  • Added ModelRegistry with 7 pre-registered models and task-based ModelRouter
  • Added Source connectors: LocalConnector, S3Connector, GCSConnector, AzureConnector, SFTPConnector
  • Added FastAPI service with /extract, /health, /models, /schemas, /formats endpoints
  • Added Storage writers: JSONLWriter, CSVWriter
  • Added BatchProcessor with configurable concurrency via asyncio.Semaphore
  • Added SchemaValidator + RetryHandler — up to max_retries corrective retry attempts
  • Added Chunker with fixed, sentence, heading, and semantic strategies
  • Added Settings and YAML config loader with SD_* environment variable overrides
  • Added 110 unit tests across pipeline, parsers, validators, storage, API, and schemas

Upcoming

What's next

The roadmap is open — contributions and feedback welcome on GitHub Discussions.

v0.3.0 Q2 2026

ETL Pipeline Framework

  • Scheduled batch processing with cron-style triggers
  • YAML workflow definitions (sources, transforms, destinations)
  • Per-field success rates, fill rates, and cost tracking
  • Resume failed batches from checkpoint
  • Native Airflow / Prefect operator
v0.4.0 Q3 2026

Human-in-the-Loop

  • HITL review queue for low-confidence extractions
  • API authentication (API key + JWT) + rate limiting
  • React-based monitoring dashboard
  • Versioned prompt template management
  • Feedback loop — approved corrections update model routing heuristics
v0.5.0 Q4 2026

Multi-Modal & Agents

  • Native vision models (GPT-4o, Claude Sonnet vision, Qwen-VL)
  • Table-from-image extraction without OCR
  • Agent-mode extraction — model requests specific page regions
  • Streaming output support

Stay updated

Watch the GitHub repository to get notified of new releases.

Watch on GitHub →