BookWyrm Endpoint Overview

Build AI Workflows Without Pre-Processing Pain

BookWyrm is a Python SDK & API that provides a data pipeline for reliable AI workflows. Use the AI data preparation endpoints to automate complex document processing tasks like extraction, semantic chunking, and using document citations in agents, leaving you to focus on building agents and AI task automation that delivers real business value.

Extract PDF text with Python - no pre-processing needed

from bookwyrm import BookWyrmClient

client = BookWyrmClient()

# Extract PDF text - no pre-processing needed
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
    if hasattr(response, 'page_data'):
        # Use extracted text directly in your AI workflow
        process_with_ai(response.page_data)
Endpoint Overview

A Flexible API for AI Workflows

We provide fully managed, production-ready API endpoints that solve specific, complex developer problems for building AI workflows right out of the box.

Start by extracting and processing text from documents, then use the endpoints generate structured output or produce answers grounded in your data.

Extract from PDF/Excel/CSV

# CLIbookwyrm extract-pdf document.pdf --output extracted.json

The Problem

Low-quality text extraction from native/image PDFs that breaks context and requires AI data cleaning.

Core Value

High-fidelity text output: Extract PDF text with Python - extracts clean, non-messy text from any PDF through unstructured data extraction. Includes crucial position information (e.g., page and coordinates) for advanced retrieval and UI highlighting.

Text Processing (Chunking)

# CLIbookwyrm phrasal --file document.txt --format with_offsets --output phrases.jsonl

The Problem

Generic, token-based splitting that breaks semantic meaning, leading to poor retrieval quality.

Core Value

Context-aware chunking: Semantic chunking for RAG - splits documents into meaningful, context-aware chunks and phrases instead of arbitrary token windows. Allows configurable sizing to fit your retrieval needs.

Structured Output (Summarization)

# CLIbookwyrm summarize data/country-of-the-blind-phrases.jsonl 
  --model-class-file data/summary.py   
  --model-class-name Summary   
  --model-strength smart   
  --output data/country-structured-summary.json   
  --verbose

The Problem

Obtaining reliable, accurate structured output for AI workflows and AI task automation takes time and resource.

Core Value

Generate user specified json outputs from any document: Need to process invoices, receipts, vacation requests, whatever you want to automate, create a Pynadic model class and use it with the Summarize endpoint to get structured JSON from any document in a few lines of code.

Citation (Deep Reader)

# CLIbookwyrm cite "What is the main theme?" --url https://example.com/chunks.jsonl

The Problem

AI hallucination, lack of trust, and an inability to trace the source of an answer.

Core Value

Source-grounded AI & traceability: Ask a question against your chunks and get back citations with the original reasoning context. Build source-grounded AI by grounding your RAG answers in the source material for total control and trust.

Summarization

# CLIbookwyrm summarize phrases.jsonl --output summary.json

The Problem

Feeding large, noisy blocks of text into agents or models, consuming unnecessary tokens and budget.

Core Value

Concise, embeddable text: Collapse long or noisy text into concise summaries that are easier to embed, search, or feed into agents for efficient context injection.

Clasify

# CLIbookwyrm classify --file document.pdf --output results.json

The Problem

Difficulty routing documents correctly for different agentic pipelines, indexing, or processing logic.

Core Value

Intelligent routing: Intelligently classify documents or raw text by format, type, and structure for efficient downstream routing and indexing.

Streaming

# CLIbookwyrm cite "AI applications" chunks.jsonl --stream -v

The Problem

Dealing with long processing times in live applications, which degrades user experience.

Core Value

Real-time progress: Real-time processing with progress updates for all major operations, ideal for live applications where you need to surface results quickly.

Each endpoint is standalone, so you can slot BookWyrm into an existing AI data pipeline or stitch multiple pieces together for your AI workflows.

These capabilities work together to provide a complete AI data pipeline for document ingestion, AI data processing, and retrieval - the foundation of any RAG system and AI workflows.

Discord Logo

Want to see BookWyrm in action?

The easiest way is to join our Discord server and ask for a demo. One of the team can then join you in a voice channel, show you BookWyrm's endpoints in action, and answer any questions you may have.

Code Examples

From Raw Documents to Intelligent Applications

Transform unstructured data extraction into actionable intelligence with BookWyrm's composable API pipeline for back office automation.

Start by Processing your Documents

1. Extract PDF Text with Python

Start with unstructured data extraction from PDFs, documents, or web pages:

from bookwyrm import BookWyrmClient

client = BookWyrmClient()

# Extract PDF text with Python - unstructured data extraction
with open("invoice.pdf", "rb") as f:
    pdf_bytes = f.read()

pages = []
for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
    if hasattr(response, 'page_data'):
        pages.append(response.page_data)

# Send extracted pages directly to phrasal analysis
process_with_phrasal_analysis(pages)

2. Context-Aware Chunking for Semantic Chunking for RAG

Process extracted text with context-aware chunking into semantically meaningful chunks for AI consumption:

# Context-aware chunking for semantic chunking for RAG
chunks = []
for response in client.stream_process_text(
    text=full_text,
    chunk_size=2000,  # Optimal for most LLMs
    offsets=True
):
    if isinstance(response, TextSpanResult):
        chunks.append(response)

# Result: List of TextSpanResult with semantic boundaries
# Each chunk contains complete phrases/sentences up to size limit

3. Then Build Intelligence on Top

Swipe left or right to explore use cases

Invoice Processing & Workflow Automation

Extract structured data from invoices for automated processing:

from pydantic import BaseModel, Field
import json

class Invoice(BaseModel):
    invoice_number: str = Field(description="Unique invoice identifier")
    vendor_name: str = Field(description="Vendor or supplier name")
    total_amount: float = Field(description="Total invoice amount")
    due_date: str = Field(description="Payment due date")
    line_items: list[str] = Field(description="List of items or services")
# Convert invoice text to structured data
model_schema = json.dumps(Invoice.model_json_schema())

for response in client.stream_summarize(
    content=full_text,
    model_name="Invoice",
    model_schema_json=model_schema
):
    if hasattr(response, 'summary'):
        invoice_data = Invoice.model_validate(json.loads(response.summary))
        # → Route to accounting system, trigger approval workflow
        break

Factual, Citable Knowledge Agent

Build source-grounded AI agents that cite their sources for every claim:

# User asks a question
user_question = "What were the Q3 revenue figures?"

# Find relevant citations from your document chunks
citations = []
for response in client.stream_citations(
    chunks=chunks,
    question=user_question
):
    if hasattr(response, 'citation'):
        citations.append(response.citation)

# Citations include:
# - Exact text span from source documents
# - Quality score (0-4) for relevance
# - Reasoning for why it's relevant
# - Chunk indices for traceability

# Build source-grounded AI response with grounded facts
for citation in sorted(citations, key=lambda c: c.quality, reverse=True)[:3]:
    print(f"Source: {citation.text}")
    print(f"Relevance: {citation.reasoning}
")

Data Enrichment & Enhancement

Define a Pydantic model to crate structured JSON output from your documents:

class EnrichedProduct(BaseModel):
    name: str = Field(description="Product name")
    category: str = Field(description="Product category")
    sentiment: str = Field(description="Customer sentiment (positive/neutral/negative)")
    key_features: list[str] = Field(description="Main product features")
    target_audience: str = Field(description="Likely target customer segment")
    price_positioning: str = Field(description="Price positioning strategy")
# Transform customer reviews into enriched product intelligence
model_schema = json.dumps(EnrichedProduct.model_json_schema())

for response in client.stream_summarize(
    phrases=customer_review_chunks,  # From phrasal analysis
    model_name="EnrichedProduct",
    model_schema_json=model_schema
):
    if hasattr(response, 'summary'):
        enriched = EnrichedProduct.model_validate(json.loads(response.summary))
        # → Update product database with AI insights
        break

Why This Pipeline Works

  • Semantic Chunking for RAG: Context-aware chunking preserves meaning across document boundaries
  • Structured Output: Get consistent JSON for automation
  • Source Attribution: Every fact traces back to original text
  • Composable: Mix and match endpoints for your use case
Dev Experience

A Logical Developer Experience

Drop-in Integration

pip install bookwyrm
  • Composable & Modular: Each endpoint is standalone—use only what you need.
  • Slot BookWyrm into existing AI data pipelines or build end-to-end AI workflows.
  • API-first with native Python SDK (sync/async support).

Developer-Friendly Tooling

  • Rich CLI for quick testing and experimentation.
  • Clear request/response models with full type hints.
  • Async streaming support for all major operations.
  • Sane defaults that work out of the box.

Production-Ready from Day One

  • Move from prototype to production without rewriting.
  • Enterprise-grade throughput and reliability.
  • Built-in error handling and retry logic.
  • Streaming progress updates for long-running tasks.
Dev Help

Let's Co-Design Your First Agentic Pipeline. (For Free.)

We are looking for startup and small enterprise builders who need an AI strategy. We have deep expertise in building real agentic pipelines. To help you bootstrap, we're offering to build some of the elements you need that we don't currently have, for free.

This isn't a sales pitch. If you're serious about BookWyrm, we want to help you succeed. Typically, we'd start with a hands-on technical workshop where we will:

  • Help you map out a high-impact workflow for your business (RAG, citation extraction, enrichment, or something new).
  • Help you solve a specific data problem by advancing our tech to fit your needs.
  • Show you how BookWyrm's flexible pipeline can take your workflow from whiteboard to production, faster.
Co-Design Illustration

Security & Privacy

BookWyrm never stores or inspects your data. All processing happens in-memory during the request lifecycle, and results are returned directly to you. We don't log your documents, questions, or responses. Your data stays private, and only you control it.

BookWyrm Delivers Your Agentic Workflows Strategy.

Your data pipeline is the foundation for your agentic workflows. Build it right. Get started with the API that's fast to set up, easy to extend, and built for developers.