BookWyrm is a Python SDK & API that provides a data pipeline for reliable AI workflows. Use the AI data preparation endpoints to automate complex document processing tasks like extraction, semantic chunking, and using document citations in agents, leaving you to focus on building agents and AI task automation that delivers real business value.
Extract PDF text with Python - no pre-processing needed
from bookwyrm import BookWyrmClient
client = BookWyrmClient()# Extract PDF text - no pre-processing neededwithopen("document.pdf","rb")as f: pdf_bytes = f.read()for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):ifhasattr(response,'page_data'):# Use extracted text directly in your AI workflow process_with_ai(response.page_data)
We provide fully managed, production-ready API endpoints that solve specific, complex developer problems for building AI workflows right out of the box.
Start by extracting and processing text from documents, then use the endpoints generate structured output or produce answers grounded in your data.
Low-quality text extraction from native/image PDFs that breaks context and requires AI data cleaning.
Core Value
High-fidelity text output: Extract PDF text with Python - extracts clean, non-messy text from any PDF through unstructured data extraction. Includes crucial position information (e.g., page and coordinates) for advanced retrieval and UI highlighting.
Generic, token-based splitting that breaks semantic meaning, leading to poor retrieval quality.
Core Value
Context-aware chunking: Semantic chunking for RAG - splits documents into meaningful, context-aware chunks and phrases instead of arbitrary token windows. Allows configurable sizing to fit your retrieval needs.
Obtaining reliable, accurate structured output for AI workflows and AI task automation takes time and resource.
Core Value
Generate user specified json outputs from any document: Need to process invoices, receipts, vacation requests, whatever you want to automate, create a Pynadic model class and use it with the Summarize endpoint to get structured JSON from any document in a few lines of code.
Citation (Deep Reader)
# CLIbookwyrm cite "What is the main theme?" --url https://example.com/chunks.jsonl
The Problem
AI hallucination, lack of trust, and an inability to trace the source of an answer.
Core Value
Source-grounded AI & traceability: Ask a question against your chunks and get back citations with the original reasoning context. Build source-grounded AI by grounding your RAG answers in the source material for total control and trust.
Feeding large, noisy blocks of text into agents or models, consuming unnecessary tokens and budget.
Core Value
Concise, embeddable text: Collapse long or noisy text into concise summaries that are easier to embed, search, or feed into agents for efficient context injection.
Dealing with long processing times in live applications, which degrades user experience.
Core Value
Real-time progress: Real-time processing with progress updates for all major operations, ideal for live applications where you need to surface results quickly.
Each endpoint is standalone, so you can slot BookWyrm into an existing AI data pipeline or stitch multiple pieces together for your AI workflows.
These capabilities work together to provide a complete AI data pipeline for document ingestion, AI data processing, and retrieval - the foundation of any RAG system and AI workflows.
Want to see BookWyrm in action?
The easiest way is to join our Discord server and ask for a demo. One of the team can then join you in a voice channel, show you BookWyrm's endpoints in action, and answer any questions you may have.
Transform unstructured data extraction into actionable intelligence with BookWyrm's composable API pipeline for back office automation.
Start by Processing your Documents
1. Extract PDF Text with Python
Start with unstructured data extraction from PDFs, documents, or web pages:
from bookwyrm import BookWyrmClient
client = BookWyrmClient()# Extract PDF text with Python - unstructured data extractionwithopen("invoice.pdf","rb")as f: pdf_bytes = f.read()pages =[]for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):ifhasattr(response,'page_data'): pages.append(response.page_data)# Send extracted pages directly to phrasal analysisprocess_with_phrasal_analysis(pages)
2. Context-Aware Chunking for Semantic Chunking for RAG
Process extracted text with context-aware chunking into semantically meaningful chunks for AI consumption:
# Context-aware chunking for semantic chunking for RAGchunks =[]for response in client.stream_process_text( text=full_text, chunk_size=2000,# Optimal for most LLMs offsets=True):ifisinstance(response, TextSpanResult): chunks.append(response)# Result: List of TextSpanResult with semantic boundaries# Each chunk contains complete phrases/sentences up to size limit
3. Then Build Intelligence on Top
Swipe left or right to explore use cases
Invoice Processing & Workflow Automation
Extract structured data from invoices for automated processing:
from pydantic import BaseModel, Field
import json
classInvoice(BaseModel): invoice_number:str= Field(description="Unique invoice identifier") vendor_name:str= Field(description="Vendor or supplier name") total_amount:float= Field(description="Total invoice amount") due_date:str= Field(description="Payment due date") line_items:list[str]= Field(description="List of items or services")
# Convert invoice text to structured datamodel_schema = json.dumps(Invoice.model_json_schema())for response in client.stream_summarize( content=full_text, model_name="Invoice", model_schema_json=model_schema
):ifhasattr(response,'summary'): invoice_data = Invoice.model_validate(json.loads(response.summary))# → Route to accounting system, trigger approval workflowbreak
Factual, Citable Knowledge Agent
Build source-grounded AI agents that cite their sources for every claim:
# User asks a questionuser_question ="What were the Q3 revenue figures?"# Find relevant citations from your document chunkscitations =[]for response in client.stream_citations( chunks=chunks, question=user_question
):ifhasattr(response,'citation'): citations.append(response.citation)# Citations include:# - Exact text span from source documents# - Quality score (0-4) for relevance# - Reasoning for why it's relevant# - Chunk indices for traceability# Build source-grounded AI response with grounded factsfor citation insorted(citations, key=lambda c: c.quality, reverse=True)[:3]:print(f"Source: {citation.text}")print(f"Relevance:{citation.reasoning}")
Data Enrichment & Enhancement
Define a Pydantic model to crate structured JSON output from your documents:
Let's Co-Design Your First Agentic Pipeline. (For Free.)
We are looking for startup and small enterprise builders who need an AI strategy. We have deep expertise in building real agentic pipelines. To help you bootstrap, we're offering to build some of the elements you need that we don't currently have, for free.
This isn't a sales pitch. If you're serious about BookWyrm, we want to help you succeed. Typically, we'd start with a hands-on technical workshop where we will:
Help you map out a high-impact workflow for your business (RAG, citation extraction, enrichment, or something new).
Help you solve a specific data problem by advancing our tech to fit your needs.
Show you how BookWyrm's flexible pipeline can take your workflow from whiteboard to production, faster.
BookWyrm never stores or inspects your data. All processing happens in-memory during the request lifecycle, and results are returned directly to you. We don't log your documents, questions, or responses. Your data stays private, and only you control it.
BookWyrm Delivers Your Agentic Workflows Strategy.
Your data pipeline is the foundation for your agentic workflows. Build it right. Get started with the API that's fast to set up, easy to extend, and built for developers.