Read more about workflow automation for business in our blog.
According to McKinsey's State of AI report, workflow redesign has the biggest impact on EBIT. Learn how to build reliable, scalable back office automation pipelines with BookWyrm's data pipeline API.
Transform unstructured documents, invoices, contracts, forms, and reports, into high-fidelity, AI-ready data for automated back office workflows you can trust.
McKinsey's State of AI report reveals a critical insight: "The redesign of workflows has the biggest effect on an organization's ability to see EBIT impact from its use of gen AI."
Yet 80% of organizations aren't seeing tangible bottom-line value from their AI investments. For back office operations—invoicing, document processing, data entry, and compliance—the gap between success and failure comes down to one thing: the data pipeline.
Inaccuracy is one of the top-three gen-AI-related risks. You cannot build a reliable, redesigned back office workflow on a foundation of unpredictable, inaccurate data. If the data is messy, the workflow is chaotic—and back office errors are costly.
High-performing organizations build a central, governable, reusable data pipeline that all their AI models and agents can plug into. This is the fundamental change that allows them to scale back office automation.
Read about a practical way to implement workflow automation in your business.
An AI-driven back office workflow is, at its core, an agentic pipeline. It needs to ingest data (like messy PDFs, invoices, contracts, emails, or Excels), understand it, and act on it reliably. Success requires a pipeline that can take any unstructured, messy document and instantly transform it into high-fidelity, AI-ready data for automated processing.
Need help? Explore our consulting services or learn more about BookWyrm.
BookWyrm provides a composable API pipeline that transforms unstructured back office documents into intelligent, actionable data for your automated workflows.
Extract structured text from PDFs, documents, CSVs, and web pages. Handles complex layouts, scanned documents, and multi-page files—perfect for invoices, contracts, and forms.
Create semantic chunks with phrasal analysis. Preserve meaning across document boundaries with optimal token sizing for LLMs.
Transform chunks into structured JSON using Pydantic models. Generate consistent, validated output for automation.
Deploy back office workflows with citations, reasoning, and traceability. Build reliable automation you can measure and trust for invoicing, document processing, and data entry.
BookWyrm provides a Python SDK with streaming endpoints for efficient document processing. Each endpoint is designed to be composable, allowing you to build custom back office automation workflows for invoicing, document processing, data entry, and compliance.
Extract structured text from PDF documents
from bookwyrm import BookWyrmClient
client = BookWyrmClient()
# Extract structured text from PDFs
with open("invoice.pdf", "rb") as f:
pdf_bytes = f.read()
pages = []
for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
if hasattr(response, 'page_data'):
pages.append(response.page_data)
# pages now contains structured text blocks with coordinates
# and confidence scores for each pageFor complete API reference, endpoint details, and additional examples, see the BookWyrm Client documentation.
This example demonstrates a complete end-to-end back office automation workflow for processing invoices. The pipeline extracts data from PDF invoices, processes it into semantic chunks, and transforms it into structured JSON for automated workflow routing.
Watch the video below to see this workflow in action, then review the code to understand the implementation.
See how BookWyrm processes invoices in a real application interface.
The PDF Structured Output Server, that enables this demonstration to work, is a working example of the structured output pipeline and is available to clone from GitHub.
This code shows the complete workflow from PDF extraction to structured output and workflow routing.
from bookwyrm import BookWyrmClient
from bookwyrm.models import SummaryResponse, ModelStrength, TextSpanResult
from pydantic import BaseModel, Field
import json
# Define your data model
class Invoice(BaseModel):
invoice_number: str = Field(description="Unique invoice identifier")
vendor_name: str = Field(description="Vendor or supplier name")
total_amount: float = Field(description="Total invoice amount")
due_date: str = Field(description="Payment due date")
line_items: list[str] = Field(description="List of items or services")
# Initialize client
client = BookWyrmClient()
# Step 1: Extract PDF
with open("invoice.pdf", "rb") as f:
pdf_bytes = f.read()
pages = []
for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
if hasattr(response, 'page_data'):
pages.append(response.page_data)
# Step 2: Process into chunks
full_text = "\n".join([page.text for page in pages])
chunks = []
for response in client.stream_process_text(
text=full_text,
chunk_size=2000,
offsets=True
):
if isinstance(response, TextSpanResult):
chunks.append(response)
# Step 3: Summarize into structured data
model_schema = json.dumps(Invoice.model_json_schema())
final_summary: SummaryResponse = None
for response in client.stream_summarize(
content=full_text,
model_name="Invoice",
model_schema_json=model_schema,
model_strength=ModelStrength.SMART
):
if hasattr(response, 'summary'):
final_summary = response
break
# Step 4: Validate and use
if final_summary:
invoice_data = Invoice.model_validate(
json.loads(final_summary.summary)
)
# Route to accounting system
route_to_accounting(invoice_data)
# Trigger approval workflow
if invoice_data.total_amount > 10000:
trigger_approval_workflow(invoice_data)The pipeline produces validated JSON that matches your Pydantic model, ready for integration with your workflow systems.
{
"invoice_number": "INV-2024-001",
"vendor_name": "Acme Corporation",
"total_amount": 15420.50,
"due_date": "2024-12-15",
"line_items": [
"Software License - Annual",
"Support Services - Q4",
"Training Sessions"
]
}Handles various invoice formats, scanned documents, and complex layouts with confidence scores.
Preserves meaning across document boundaries, ensuring accurate data extraction.
Pydantic validation ensures consistent, structured output that integrates seamlessly with your systems.
Track processing accuracy, workflow routing success, and ROI with reliable, consistent data.
Read more about workflow automation for business in our blog.
BookWyrm provides streaming endpoints for efficient document processing. All endpoints support streaming responses for handling large documents.
Extract structured text from PDF documents
Stream of PageData objects with text blocks and coordinates
Process text into semantic chunks with phrasal analysis
Stream of TextSpanResult objects
Transform content into structured JSON using Pydantic models
Stream of SummaryResponse with validated JSON
Find relevant citations with quality scores and reasoning
Stream of CitationResult objects
For detailed parameter descriptions, response schemas, error handling, and additional examples, see the complete BookWyrm Client documentation.
View Full DocumentationFull-featured Python client with type hints and streaming support
All endpoints support streaming for efficient processing of large documents
Comprehensive examples for common use cases and workflows
Your data pipeline is the foundation for your agentic workflows. Build it right. Get started with the API that's fast to set up, easy to extend, and built for developers.