Complex PDF Example
For enterprise clients, we offer the ability to handle poor quality scanned PDFs. The below example is taken from Heinrich Palaces, a scanned German book.

We used this command:
bookwyrm extract-pdf data/Heinrich_palaces.pdf --num-pages 1 --start-page 18BookWyrm Output
Even from a poor-quality scanned PDF, BookWyrm outputs structured, AI-ready data with text blocks, confidence scores, and coordinates, allowing you to reconstruct the correct reading order, index content that would otherwise stay hidden, and surface it for real business use. Perfect for back office automation. These types of uses can also be applied to other document types too.
{
"pages": [
{
"page_number": 18,
"text_blocks": [
...
{
"text": "Taf. 15 im arabischen Teil); der Raum ist wahrscheinlich ein Bad. Auch das spricht dafür, da Raum 4 überdeckt",
"confidence": 0.9871127605438232,
"bbox": [
[
117.0,
87.0
],
[
1349.0,
129.0
],
[
1347.0,
162.0
],
[
116.0,
120.0
]
],
"coordinates": {
"x1": 116.0,
"y1": 87.0,
"x2": 1349.0,
"y2": 162.0
}
},
...
]
}
]
