Feature endpoint:
Extract from PDF
bookwyrm extract-pdf document.pdf --output extracted.json
The problem:
Low-quality text extraction from native/image PDFs that breaks context and requires manual cleanup.
Core value:
High-fidelity text output: Extracts clean, non-messy text from any PDF. Includes crucial position information (e.g., page and coordinates) for advanced retrieval and UI highlighting.
Feature endpoint:
Citation (Deep Reader)
bookwyrm cite "What is the main theme?" --url https://example.com/chunks.jsonl
The problem:
AI hallucination, lack of trust, and an inability to trace the source of an answer.
Core value:
Hallucination control & traceability: Ask a question against your chunks and get back citations with the original reasoning context. Ground your RAG answers in the source material for total control and trust.
Feature endpoint:
Text Processing (Chunking)
bookwyrm phrasal --file document.txt --format with_offsets --output phrases.jsonl
The problem:
Generic, token-based splitting that breaks semantic meaning, leading to poor retrieval quality.
Core value:
Semantically-balanced phrasing: Splits documents into meaningful, contextually-aware chunks and phrases instead of arbitrary token windows. Allows configurable sizing to fit your retrieval needs.
Feature endpoint:
Summarization
bookwyrm summarize phrases.jsonl --output summary.json
The problem:
Feeding large, noisy blocks of text into agents or models, consuming unnecessary tokens and budget.
Core value:
Concise, embeddable text: Collapse long or noisy text into concise summaries that are easier to embed, search, or feed into agents for efficient context injection.
Feature endpoint:
Categorization
bookwyrm classify --file document.pdf --output results.json
The problem:
Difficulty routing documents correctly for different RAG pipelines, indexing, or processing logic.
Core value:
Intelligent routing: Intelligently classify documents or raw text by format, type, and structure for efficient downstream routing and indexing.
Feature endpoint:
Streaming
bookwyrm cite "AI applications" chunks.jsonl --stream -v
The problem:
Dealing with long processing times in live applications, which degrades user experience.
Core value:
Real-time progress: Real-time processing with progress updates for all major operations, ideal for live applications where you need to surface results quickly.
Each endpoint is standalone, so you can slot BookWyrm into an existing pipeline or stitch multiple pieces together.
These capabilities work together to provide a complete pipeline for document ingestion, processing, and retrieval - the foundation of any RAG system.