BookWyrm: Production-Ready Multimodal Extraction
BookWyrm is designed to solve the last mile problem of complex document processing, transforming your most difficult PDFs into clean, semantically rich data. We combine the state-of-the-art in document intelligence into a single, robust, and fully automated API. Works seamlessly with our PDF extraction and RAG pipeline capabilities.
The result? You stop writing fragile preprocessing code and start building agents that matter. Perfect for back office automation.
Quick Start: Extract Structured PDF Data
BookWyrm makes the hardest part of document processing the easiest part of your pipeline.
Use the single extract-pdf endpoint (via SDK or REST) to get structured, multimodal data ready for your next-generation LLMs.
from typing import BinaryIO
from bookwyrm.models import PDFPage, PDFTextElement
# Load PDF file using raw bytes (recommended)
with open("document.pdf", "rb") as f:
f: BinaryIO
pdf_bytes: bytes = f.read()
pages: List[PDFPage] = []
for response in client.stream_extract_pdf(
pdf_bytes=pdf_bytes,
filename="document.pdf"
):
if hasattr(response, 'page_data'):
pages.append(response.page_data)
elif hasattr(response, 'total_pages') and hasattr(response, 'type') and response.type == "metadata":
print(f"Starting extraction of {response.total_pages} pages")
print(f"Extracted {len(pages)} pages")
page: PDFPage
for page in pages:
print(f"Page {page.page_number}: {len(page.text_blocks)} text elements")
element: PDFTextElement
for element in page.text_blocks[:3]: # Show first 3 elements
print(f" - {element.text[:50]}...")
# pages is List[PDFPage] where each PDFPage has:
# - page_number: int (1-based)
# - text_blocks: List[PDFTextElement]
# - tables: List[dict] (placeholder)
# - images: List[dict] (placeholder)
#
# Each PDFPage has:
# - page_number: int (1-based)
# - text_blocks: List[PDFTextElement]
# - tables: List[dict] (placeholder)
# - images: List[dict] (placeholder)
#
# Each PDFTextElement has:
# - text: str (extracted text)
# - confidence: float (0.0-1.0 OCR confidence)
# - bbox: List[List[float]] (raw polygon coordinates)
# - coordinates: PDFBoundingBox (x1, y1, x2, y2 rectangle)
Remove workplace friction and let BookWyrm handle multimodal pre-processing pain.

