The Developer Story

You've got data everywhere - PDFs from customer surveys, market reports in shared drives, research papers, and more. You want to build an agent that can actually use this information, but first you need to wrangle it into something usable.

Step 1: Classify your documents

Start by figuring out what kind of files you’re dealing with. With BookWyrm, it’s one command:

bookwyrm classify --file customer-satisfaction.pdf --output satisfaction.json

Now you know exactly what you're working with, ready for embedding and indexing in your favorite vector database (e.g. Pinecone).

Step 2: Chunk your text the smart way

Instead of splitting text at arbitrary token lengths, BookWyrm uses phrasal models to break content into meaningful units:

bookwyrm phrasal --file satisfaction.txt --format with_offsets --output phrases.jsonl

Step 3: Summarize for clarity

Huge documents? No problem. Generate summaries that make it easier to embed, search, or serve to users:

bookwyrm summarize phrases.jsonl --output summary.json

Step 4: Ground answers with citations

When your agent answers a question, you want sources you can trust. BookWyrm's citation endpoint finds and justifies them:

bookwyrm cite "What are the outcomes from the customer satisfaction survey?" phrases.jsonl --output results.json

In just a few steps, you've built a RAG pipeline: documents classified → text chunked → content summarized → citations retrieved.

You have not needed to test models, touch regex, or duplicate tasks for different file types. BookWyrm handled the grunt work, letting you focus on building agents and applications that deliver reliable results.

Supercharge Business Processes with AI

Get hands-on with BookWyrm and see how fast you can go from raw, messy text to production-ready pipelines.

Join our free workshop to learn best practices.

Request beta access and be one of the select few to start using BookWyrm.

Offer:

€20 free credits for beta sign ups plus €100 top up for the first 100

€100 top up offer expires Nov 1st 2025.