Bookwyrm Logo  RAG for Unstructured Data

Build agents you can trust without pre-processing pain

BookWyrm is a Python SDK & API that automates complex tasks like phrasal chunking, summarization, citation, and classification.

Wish your job wasn't filled with wrangling text from PDFs? Let BookWyrm do it.

Do you want to stop agent hallucinations and increase user trust? Use BookWyrm's cite endpoint to answer questions complete with context citations and reasoning.

Join the Beta

Offer:

€20 free credits for beta sign ups plus €100 top up for the first 100

€100 top up offer expires Nov 1st 2025.

BookWyrm Features

Endpoints that enable rapid AI agent deployment

We provide fully managed, production-ready endpoints that solve specific, complex developer problems right out of the box.

Feature endpoint:

Extract from PDF


bookwyrm extract-pdf document.pdf --output extracted.json

The problem:


Low-quality text extraction from native/image PDFs that breaks context and requires manual cleanup.

Core value:


High-fidelity text output: Extracts clean, non-messy text from any PDF. Includes crucial position information (e.g., page and coordinates) for advanced retrieval and UI highlighting.

Feature endpoint:

Citation (Deep Reader)


bookwyrm cite "What is the main theme?" --url https://example.com/chunks.jsonl

The problem:


AI hallucination, lack of trust, and an inability to trace the source of an answer.

Core value:


Hallucination control & traceability: Ask a question against your chunks and get back citations with the original reasoning context. Ground your RAG answers in the source material for total control and trust.

Feature endpoint:

Text Processing (Chunking)


bookwyrm phrasal --file document.txt --format with_offsets --output phrases.jsonl

The problem:


Generic, token-based splitting that breaks semantic meaning, leading to poor retrieval quality.

Core value:


Semantically-balanced phrasing: Splits documents into meaningful, contextually-aware chunks and phrases instead of arbitrary token windows. Allows configurable sizing to fit your retrieval needs.

Feature endpoint:

Summarization


bookwyrm summarize phrases.jsonl --output summary.json

The problem:


Feeding large, noisy blocks of text into agents or models, consuming unnecessary tokens and budget.

Core value:


Concise, embeddable text: Collapse long or noisy text into concise summaries that are easier to embed, search, or feed into agents for efficient context injection.

Feature endpoint:

Categorization


bookwyrm classify --file document.pdf --output results.json

The problem:


Difficulty routing documents correctly for different RAG pipelines, indexing, or processing logic.

Core value:


Intelligent routing: Intelligently classify documents or raw text by format, type, and structure for efficient downstream routing and indexing.

Feature endpoint:

Streaming


bookwyrm cite "AI applications" chunks.jsonl --stream -v

The problem:


Dealing with long processing times in live applications, which degrades user experience.

Core value:


Real-time progress: Real-time processing with progress updates for all major operations, ideal for live applications where you need to surface results quickly.

Each endpoint is standalone, so you can slot BookWyrm into an existing pipeline or stitch multiple pieces together.

These capabilities work together to provide a complete pipeline for document ingestion, processing, and retrieval - the foundation of any RAG system.

Pre-processing headaches eased with BookWyrm

API-first: Drop into your stack with a REST API or use the Python client (sync/async) with just a few lines of code.

Composable & modular: Use only what you need. Each endpoint is standalone, so you can slot BookWyrm into an existing pipeline or stitch multiple pieces together.

Developer-friendly: Rich CLI, clear request/response models, async support, and sane defaults mean you get results in minutes, not days.

Production ready: Move directly from rapid prototyping to production with an API built for enterprise-grade throughput and reliability.

bookwyrm cite "who are the main antagonists?" data/dr_jekyll_and_mr_hyde.jsonl
...
  {
    "start_chunk": 40,
    "end_chunk": 40,
    "text": "Well, sir, the two ran into one another naturally enough at the corner; and then came the horrible part of the thing; for the man trampled calmly over the child’s body and left her screaming on the ground.",
    "reasoning": "This chunk directly identifies the man who trampled the child as the antagonist of the story.",
    "quality": 4
  }  
  ...

The Developer Experience

Install via pip install bookwyrm

Import and start chunking, summarizing, or citing in just a few lines of Python

Or explore interactively via the CLI for quick experiments

Flexible enough for rapid prototyping, robust enough for production

Security & Privacy

BookWyrm never stores or inspects your data. All processing happens in-memory during the request lifecycle, and results are returned directly to you. We don't log your documents, questions, or responses. Your data stays private, and only you control it.