Core Pipeline API
Reference for Pipeline, ExtractionResult, ParsedDocument, TextChunk, and Settings.
Pipeline
The primary entry point. Pipeline wires together all stages
(ingestion → preprocessing → routing → inference → validation → storage).
Create a new pipeline instance.
Parameters
None when using RAG-only mode. required TaskType.EXTRACTION. optional configs/default.yaml. optional None, resolved from config. optional configs/models.yaml. optional vector_store. optional Pipeline.run()
Extract structured data from a single file.
Parameters
"ocr_pdf"). optional "deepseek-r1-70b"). optional Returns
result.is_valid before using result.structured_output. Example
results = await pipeline.run(
Path("report.pdf"),
save_format="jsonl",
)
for r in results:
if r.is_valid:
print(r.structured_output) Pipeline.run_many()
Concurrently extract from multiple files.
Parameters
run() for each file. optional Returns
Pipeline.build_index()
Parse a document, chunk it, and insert into a vector or summary index.
Parameters
"vector". optional ExtractionResult
Returned by all pipeline methods. All fields are read-only.
class ExtractionResult(BaseModel):
result_id: str # unique UUID
document_id: str # source document UUID
chunk_id: str | None # chunk UUID if chunked
source_format: DocumentFormat
task: TaskType
model_used: str # "gpt-4o", "llama-3.1-8b", etc.
raw_output: str # raw LLM text response
structured_output: dict | list # your schema's data
is_valid: bool # True if schema validation passed
validation_errors: list[str] # empty if is_valid=True
latency_ms: float # end-to-end latency
token_usage: dict[str, int] # prompt_tokens, completion_tokens, total_tokens
created_at: datetime ParsedDocument
Output of the ingestion stage, input to preprocessing.
class ParsedDocument(BaseModel):
metadata: DocumentMetadata
text: str # full concatenated text
pages: list[str] # per-page text (empty for non-paginated formats)
tables: list[dict] # extracted tables
images: list[str] # base64-encoded images or paths
class DocumentMetadata(BaseModel):
document_id: str # uuid hex, auto-generated
filename: str
source: str # "local", "s3://...", URL
file_extension: str
format: DocumentFormat
file_size_bytes: int
page_count: int | None
ingested_at: datetime
extra: dict[str, Any] TextChunk
Output of the preprocessing/chunking stage.
class TextChunk(BaseModel):
text: str
metadata: ChunkMetadata
class ChunkMetadata(BaseModel):
chunk_id: str
document_id: str
source_format: DocumentFormat
page_number: int | None
heading: str | None # section heading this chunk falls under
token_count: int Settings
The settings object is a Pydantic BaseSettings model loaded from YAML + environment variables.
Use get_settings() for the cached singleton or load_settings() for a fresh load.
from structure_d.config import get_settings
settings = get_settings()
print(settings.inference.provider.provider) # "vllm"
print(settings.preprocessing.chunking.max_tokens) # 1024