Document Ingestion & Processing

Turn Documents into AI-Ready Data

Standardize raw documents into structured, type-safe data before they reach your LLM.

Complete workflow: Extract PDF → Create semantic chunks

from bookwyrm import BookWyrmClient
from bookwyrm.models import TextSpanResult

client = BookWyrmClient()

# Extract structured text from PDFs
with open("invoice.pdf", "rb") as f:
    pdf_bytes = f.read()

pages = []
for response in client.stream_extract_pdf(pdf_bytes=pdf_bytes):
    if hasattr(response, 'page_data'):
        pages.append(response.page_data)

# Create semantic chunks for RAG
full_text = "\n".join([page.text for page in pages])
chunks = []
for response in client.stream_process_text(
    text=full_text,
    chunk_size=2000,
    offsets=True
):
    if isinstance(response, TextSpanResult):
        chunks.append(response)

# Your chunks are now ready for RAG retrieval
print(f"Created {len(chunks)} semantic chunks")

Join the Beta Read the Docs

Universal Document Extraction

Raw documents are noisy. BookWyrm normalizes PDFs, Excel, CSVs, and text files into a consistent JSON structure, handling the edge cases so you don't have to.

PDF Intelligence

Extracts text with bounding box coordinates and confidence scores.

Table Aware

Preserves tabular structures often lost in standard text extraction.

Noise Removal

Automatically filters artifacts to return clean, usable text.

Developer Implementation

# Extract structured data including coordinates and confidence scores
bookwyrm extract-pdf invoice_2024.pdf \
  --start-page 1 \
  --num-pages 5 \
  --output clean_data.json

Document Extraction Solutions

Explore specialized BookWyrm endpoints for different document types and use cases.

AI for PDFs

Extract clean, structured text from complex PDFs with bounding box coordinates and confidence scores.

AI for Excel

Transform Excel spreadsheets into structured, AI-ready data with preserved tabular relationships.

AI for CSVs

Convert raw CSV rows into rich, contextual narratives optimized for LLM processing and RAG.

Multimodal Text Extraction

Enterprise-grade multimodal extraction with layout analysis for production-ready AI pipelines.

Semantic Text Chunking

Naive splitting breaks context. BookWyrm's phrasal engine splits text into semantic units, ensuring your RAG retrieval steps fetch complete thoughts, not broken sentences.

Context-Aware

Respects sentence boundaries and semantic shifts.

Precision Offsets

Returns character-level start_char and end_char indices.

Reconstructable

Offsets allow you to highlight exact citations in the original document later.

Developer Implementation

# Split text into 1000-char semantic chunks with position offsets
bookwyrm phrasal \
  --file contract_draft.txt \
  --chunk-size 1000 \
  --offsets \
  --output semantic_chunks.jsonl

Intelligent Routing (Classification)

Don't guess file types. Route binary streams to the correct processor based on content, not just extensions.

Deep Inspection

Detects format, content type (e.g., python_code, json_data), and MIME type.

Confidence Scores

Returns a probability score (0.0-1.0) to help you decide when to automate and when to flag for review.

Developer Implementation

# Detect content type of unknown files before processing
bookwyrm classify --file unknown_upload.bin

Want to see BookWyrm in action?

The easiest way is to join our Discord server and ask for a demo. One of the team can then join you in a voice channel, show you BookWyrm's endpoints in action, and answer any questions you may have.

BookWyrm Discord Server

AI-Assisted Development

Don't write boilerplate. BookWyrm's library is designed with strict typing and clear signatures specifically to help AI coding assistants (like Cursor, Copilot, or Windsurf) generate implementation code instantly.

LLM-Optimized

Function signatures and docstrings are tuned for accurate AI code generation.

Agent-Ready Patterns

We provide pre-built patterns for RAG pipelines and function-calling tools that your AI can ingest and replicate.

Minimal Complexity

Key operations like citation finding or PDF extraction are designed to run in as few as 4 lines of code.

Developer Implementation

Point your coding assistant to our AI Integration Guide to generate robust pipelines in seconds.

# Example: 4-line citation search generated by AI
async with AsyncBookWyrmClient(api_key="key") as client:
    async for r in client.stream_citations(chunks=text_chunks, question="What is X?"):
        if hasattr(r, 'citation'):
            print(f"Verified Answer: {r.citation.text}")

BookWyrm Delivers Your Agentic Workflows Strategy.

Your data pipeline is the foundation for your agentic workflows. Build it right. Get started with the API that's fast to set up, easy to extend, and built for developers.

Join the Beta Book A Design Session