Bookwyrm Logo  RAG for Unstructured Data

Plugin RAG

API endpoints that extract and transform unstructured data for agentic AI.

Endpoints work as standalone or in combination.

Build using the tools you know.

A diagram showing how BookWyrm fits into a RAG pipeline

The Developer Story

You've got data everywhere - PDFs from customer surveys, market reports in shared drives, research papers, and more. You want to build an agent that can actually use this information, but first you need to wrangle it into something usable.

Step 1: Classify your documents

Start by figuring out what kind of files you’re dealing with. With BookWyrm, it’s one command:

bookwyrm classify --file customer-satisfaction.pdf --output satisfaction.json

Now you know exactly what you're working with, ready for embedding and indexing in your favorite vector database (e.g. Pinecone).

Step 2: Chunk your text the smart way

Instead of splitting text at arbitrary token lengths, BookWyrm uses phrasal models to break content into meaningful units:

bookwyrm phrasal --file satisfaction.txt --format with_offsets --output phrases.jsonl

Step 3: Summarize for clarity

Huge documents? No problem. Generate summaries that make it easier to embed, search, or serve to users:

bookwyrm summarize phrases.jsonl --output summary.json

Step 4: Ground answers with citations

When your agent answers a question, you want sources you can trust. BookWyrm's citation endpoint finds and justifies them:

bookwyrm cite "What are the outcomes from the customer satisfaction survey?" phrases.jsonl --output results.json

In just a few steps, you've built a RAG pipeline: documents classified → text chunked → content summarized → citations retrieved.

You have not needed to test models, touch regex, or duplicate tasks for different file types. BookWyrm handled the grunt work, letting you focus on building agents and applications that deliver reliable results.