Developer Documentation

Build Back Office Automation That Delivers Real Value

According to McKinsey's State of AI report, workflow redesign has the biggest impact on EBIT. Learn how to build reliable, scalable back office automation pipelines with BookWyrm's data pipeline API.

Transform unstructured documents, invoices, contracts, forms, and reports, into high-fidelity, AI-ready data for automated back office workflows you can trust.

Back office Automation Matters

McKinsey's State of AI report reveals a critical insight: "The redesign of workflows has the biggest effect on an organization's ability to see EBIT impact from its use of gen AI."

Yet 80% of organizations aren't seeing tangible bottom-line value from their AI investments. For back office operations—invoicing, document processing, data entry, and compliance—the gap between success and failure comes down to one thing: the data pipeline.

The Problem

Inaccuracy is one of the top-three gen-AI-related risks. You cannot build a reliable, redesigned back office workflow on a foundation of unpredictable, inaccurate data. If the data is messy, the workflow is chaotic—and back office errors are costly.

The Solution

High-performing organizations build a central, governable, reusable data pipeline that all their AI models and agents can plug into. This is the fundamental change that allows them to scale back office automation.

  • Mitigate Risk: Solve inaccuracy at the source, before it reaches the AI model—critical for financial and compliance workflows
  • Measure What Matters: Track well-defined KPIs for gen AI solutions (the single adoption practice with the most impact on bottom line)
  • Build Trust: Reliable outputs build trust among employees and customers, especially in back office operations where accuracy is paramount

Read about a practical way to implement workflow automation in your business.

The Agentic Pipeline for Back Office Operations

An AI-driven back office workflow is, at its core, an agentic pipeline. It needs to ingest data (like messy PDFs, invoices, contracts, emails, or Excels), understand it, and act on it reliably. Success requires a pipeline that can take any unstructured, messy document and instantly transform it into high-fidelity, AI-ready data for automated processing.

Need help? Explore our consulting services or learn more about BookWyrm.

The BookWyrm Pipeline Architecture

BookWyrm provides a composable API pipeline that transforms unstructured back office documents into intelligent, actionable data for your automated workflows.

1. Extract

Extract structured text from PDFs, documents, CSVs, and web pages. Handles complex layouts, scanned documents, and multi-page files—perfect for invoices, contracts, and forms.

2. Process

Create semantic chunks with phrasal analysis. Preserve meaning across document boundaries with optimal token sizing for LLMs.

3. Structure

Transform chunks into structured JSON using Pydantic models. Generate consistent, validated output for automation.

4. Automate

Deploy back office workflows with citations, reasoning, and traceability. Build reliable automation you can measure and trust for invoicing, document processing, and data entry.

Key Technical Benefits

  • Composable API: Mix and match endpoints for your specific back office use case
  • Streaming Responses: Process large documents efficiently with streaming endpoints
  • Source Attribution: Every fact traces back to original text with quality scores
  • Type Safety: Pydantic model validation ensures consistent structured output

BookWyrm API Code Examples

BookWyrm provides a Python SDK with streaming endpoints for efficient document processing. Each endpoint is designed to be composable, allowing you to build custom back office automation workflows for invoicing, document processing, data entry, and compliance.

Extract structured text from PDF documents

from bookwyrm import BookWyrmClient

client = BookWyrmClient()

# Extract structured text from PDFs
with open("invoice.pdf", "rb") as f:
    pdf_bytes = f.read()

pages = []
for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
    if hasattr(response, 'page_data'):
        pages.append(response.page_data)

# pages now contains structured text blocks with coordinates
# and confidence scores for each page
Documentation

Full API Documentation

For complete API reference, endpoint details, and additional examples, see the BookWyrm Client documentation.

Complete Workflow Example: Invoice Processing

This example demonstrates a complete end-to-end back office automation workflow for processing invoices. The pipeline extracts data from PDF invoices, processes it into semantic chunks, and transforms it into structured JSON for automated workflow routing.

Watch the video below to see this workflow in action, then review the code to understand the implementation.

Front-End Invoice Processing Demo

See how BookWyrm processes invoices in a real application interface.

The PDF Structured Output Server, that enables this demonstration to work, is a working example of the structured output pipeline and is available to clone from GitHub.

Complete Workflow

End-to-End Pipeline

This code shows the complete workflow from PDF extraction to structured output and workflow routing.

from bookwyrm import BookWyrmClient
from bookwyrm.models import SummaryResponse, ModelStrength, TextSpanResult
from pydantic import BaseModel, Field
import json

# Define your data model
class Invoice(BaseModel):
    invoice_number: str = Field(description="Unique invoice identifier")
    vendor_name: str = Field(description="Vendor or supplier name")
    total_amount: float = Field(description="Total invoice amount")
    due_date: str = Field(description="Payment due date")
    line_items: list[str] = Field(description="List of items or services")

# Initialize client
client = BookWyrmClient()

# Step 1: Extract PDF
with open("invoice.pdf", "rb") as f:
    pdf_bytes = f.read()

pages = []
for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
    if hasattr(response, 'page_data'):
        pages.append(response.page_data)

# Step 2: Process into chunks
full_text = "\n".join([page.text for page in pages])
chunks = []
for response in client.stream_process_text(
    text=full_text,
    chunk_size=2000,
    offsets=True
):
    if isinstance(response, TextSpanResult):
        chunks.append(response)

# Step 3: Summarize into structured data
model_schema = json.dumps(Invoice.model_json_schema())
final_summary: SummaryResponse = None

for response in client.stream_summarize(
    content=full_text,
    model_name="Invoice",
    model_schema_json=model_schema,
    model_strength=ModelStrength.SMART
):
    if hasattr(response, 'summary'):
        final_summary = response
        break

# Step 4: Validate and use
if final_summary:
    invoice_data = Invoice.model_validate(
        json.loads(final_summary.summary)
    )
    
    # Route to accounting system
    route_to_accounting(invoice_data)
    
    # Trigger approval workflow
    if invoice_data.total_amount > 10000:
        trigger_approval_workflow(invoice_data)
Structured Output

Expected Output

The pipeline produces validated JSON that matches your Pydantic model, ready for integration with your workflow systems.

{
  "invoice_number": "INV-2024-001",
  "vendor_name": "Acme Corporation",
  "total_amount": 15420.50,
  "due_date": "2024-12-15",
  "line_items": [
    "Software License - Annual",
    "Support Services - Q4",
    "Training Sessions"
  ]
}

Why This Workflow Works

Reliable Extraction

Handles various invoice formats, scanned documents, and complex layouts with confidence scores.

Semantic Processing

Preserves meaning across document boundaries, ensuring accurate data extraction.

Type Safety

Pydantic validation ensures consistent, structured output that integrates seamlessly with your systems.

Measurable Results

Track processing accuracy, workflow routing success, and ROI with reliable, consistent data.

API Reference

BookWyrm provides streaming endpoints for efficient document processing. All endpoints support streaming responses for handling large documents.

1

stream_extract_pdf

Extract structured text from PDF documents

Parameters

pdf_bytesstart_pagenum_pages

Returns

Stream of PageData objects with text blocks and coordinates

2

stream_process_text

Process text into semantic chunks with phrasal analysis

Parameters

textchunk_sizeoffsets

Returns

Stream of TextSpanResult objects

3

stream_summarize

Transform content into structured JSON using Pydantic models

Parameters

contentmodel_namemodel_schema_jsonmodel_strength

Returns

Stream of SummaryResponse with validated JSON

4

stream_citations

Find relevant citations with quality scores and reasoning

Parameters

chunksquestion

Returns

Stream of CitationResult objects

Complete API Documentation

For detailed parameter descriptions, response schemas, error handling, and additional examples, see the complete BookWyrm Client documentation.

View Full Documentation

Python SDK

Full-featured Python client with type hints and streaming support

Streaming API

All endpoints support streaming for efficient processing of large documents

Examples

Comprehensive examples for common use cases and workflows

BookWyrm Delivers Your Agentic Workflows Strategy.

Your data pipeline is the foundation for your agentic workflows. Build it right. Get started with the API that's fast to set up, easy to extend, and built for developers.