Automated Cataloguing

Turn digital stacks into structured archives

Libraries and archives spend thousands of hours manually entering metadata. BookWyrm's structured summarization automates this, extracting standardized records from scanned texts, PDFs, and articles in seconds.

Join the Beta Read the Docs

Zero-touch metadata extraction: Define your schema, get structured output

Define your standard—MARC, Dublin Core, or a custom schema—as a Pydantic model

from pydantic import BaseModel, Field
from typing import List

class LibraryRecord(BaseModel):
    title: str = Field(description="Official title of the work")
    authors: List[str] = Field(description="List of primary authors")
    isbn: str | None = Field(description="ISBN-13 if available")
    dewey_class: str | None = Field(description="Suggested Dewey Decimal class")
    subjects: List[str] = Field(description="Library of Congress Subject Headings")
    abstract: str = Field(description="A concise 100-word summary")

Zero-Touch Metadata

Define your standard, MARC, Dublin Core, or a custom schema, as a Pydantic model. BookWyrm scans the document content and populates the fields automatically, strictly adhering to your types.

Standardized

Enforce consistent formatting for authors, dates, and subjects.

Comprehensive

Extract abstract summaries, keywords, and ISBNs simultaneously.

Type-Safe

Integration-ready JSON output for your CMS or catalog software.

Developer Implementation

1. Define the Schema (catalog_model.py)

from pydantic import BaseModel, Field
from typing import List

class LibraryRecord(BaseModel):
    title: str = Field(description="Official title of the work")
    authors: List[str] = Field(description="List of primary authors")
    isbn: str | None = Field(description="ISBN-13 if available")
    dewey_class: str | None = Field(description="Suggested Dewey Decimal class")
    subjects: List[str] = Field(description="Library of Congress Subject Headings")
    abstract: str = Field(description="A concise 100-word summary")

2. Run Extraction

bookwyrm summarize incoming_scan.txt \
  --model-class-file catalog_model.py \
  --model-class-name LibraryRecord \
  --model-strength smart \
  --output record_metadata.json

Classification & Routing

Before cataloguing, you need to know what you have. Use the classify endpoint to sort incoming digital dumps into "Books", "Papers", "Correspondence", or "Administrative" buckets automatically.

Auto-Sort

Automatically categorize incoming files into predefined buckets.

Pre-Processing

Identify document types before cataloguing to streamline workflows.

Intelligent Routing

Route documents to appropriate processing pipelines based on classification.

Developer Implementation

# Auto-sort incoming files before processing
bookwyrm classify --file unknown_archive_item_01.pdf

Impact

Transform your library and archive operations with automated cataloguing that scales.

Backlog Clearance

Process thousands of digital items overnight.

Searchability

Generate rich keyword tags for items that previously had none.

Consistency

Remove human error from date formatting and author naming conventions.

BookWyrm Delivers Your Agentic Workflows Strategy.

Your data pipeline is the foundation for your agentic workflows. Build it right. Get started with the API that's fast to set up, easy to extend, and built for developers.

Join the Beta Book A Design Session