Ask HN: What is the best method for turning a scanned book as a PDF into text?
I like reading philosophy, particularly from the authors rather than a secondhand account.
However I often run into that these come as scanned documents, Discourses on Livy and Politics Among Nations for example.
I would greatly benefit from turning these into text. I can snipping tool pages and put them in ChatGPT and it turns out perfect. If I used classic methods, it often screws up words. My final goal is to turn these into audiobooks, (or even just make it easier to copypaste for my personal notes)
Given the state of AI, I'm wondering what my options are. I don't mind paying.
I did this very recently for a 19th century book in German with occasionally some Greek. The method that produces the highest level of accuracy I've found is to use ImageMagick to extract each page as a image, then send each image file to Claude Sonnet (encoded as base64) with a simple user prompt like "Transcribe the complete text from this image verbatim with no additional commentary or explanations". The whole thing is completed in under an hour & the result is near perfect and certainly much better than from standard OCR softwares.
> a 19th century book
If you're dealing with public domain material, you can just upload to archive.org. They'll OCR the whole thing and make it available to you and everyone else. (If you got it from archive.org, check the sidebar for the existing OCR files.)
I did try the full text OCR from archive.org, but unfortunately the error rate is too high. Here are some screenshots to show what I mean:
- Original book image: https://imgur.com/a8KxGpY
- OCR from archive.org: https://imgur.com/VUtjiON
- Output from Claude: https://imgur.com/keUyhjR
Ah, yeah, that's not uncommon. I was operating on an assumption, based on experience seeing language models make mistakes, that the two approaches would be within an acceptable range of each other for your texts, plus the idea that it's better to share the work than not.
Note if you're dealing with a work (or edition) that cannot otherwise be found on archive.org, though, then if you do upload it, you are permitted as the owner of that item to open up the OCRed version and edit it. So an alternative workflow might be better stated:
1. upload to archive.org
2. check the OCR results
3. correct a local copy by hand or use a language model to assist if the OCR error rate is too high
4. overwrite the autogenerated OCR results with the copy from step 3 in order to share with others
(For those unaware and wanting to go the collaborative route, there is also the Wikipedia-adjacent WMF project called Wikisource. It has the upside of being more open (at least in theory) than, say, a GitHub repo—since PRs are not required for others to get their changes integrated. One might find, however, it to be less open in practice, since it is inhabited by a fair few wikiassholes of the sort that folks will probably be familiar with from Wikipedia.)
Maybe I've just had back luck, but their OCR butchered some of the books I've tried to get
Is it really necessary to split it into pages? Not so bad if you automate it I suppose, but aren't there models that will accept a large PDF directly (I know Sonnet has a 32MB limit)?
They are limited on how much they can output and there is generally an inverse relationship between the amount of tokens you send vs quality after the first 20-30 thousand tokens.
Are there papers on this effect? That quality of responses diminishes with very large inputs I mean. I observed the same.
I think these models all "cheat" to some extent with their long context lengths.
The original transformer had dense attention where every token attends to every other token, and the computational cost therefore grew quadratically with increased context length. There are other attention patterns than can be used though, such as only attending to recent tokens (sliding window attention), or only having a few global tokens that attend to all the others, or even attending to random tokens, or using combinations of these (e.g. Google's "Big Bird" attention from their Elmo/Bert muppet era).
I don't know what types of attention the SOTA closed source models are using, and they may well be using different techniques, but it'd not be surprising if there was "less attention" to tokens far back in the context. It's not obvious why this would affect a task like doing page-by-page OCR on a long PDF though, since there it's only the most recent page that needs attending to.
Necessary? No. Better? Probably. Despite larger context windows, attention and hallucinations aren’t completely a thing of the past within the expanded context windows today. Splitting to individual pages likely helps ensure that you stay well within a normal context window size that seems to avoid most of these issues. Asking an LLM to maintain attention for a single page is much more achievable than an entire book.
Also, PDF size isn’t a relevant measurement of token lengths when it comes to PDFs which can range from a collection of high quality JPEG images to thousand(s) of pages of text
They all accept large PDFs (or any kind of input) but the quality of the output will suffer for various reasons.
I recently did some OCRing with OpenAI. I found o3-mini-hi to be imagining and changing text, whereas the older (?) o4 was more accurate. It’s a bit worrying that some of the models screw around with the text.
There’s GPT4, then GPT4o (o for Omni, as in multi modal) and then GPT o1 (chain of thought / internal reasoning) then o3 (because o2 is a stadium in London that I guess is very litigious about its trademark?), o3-mini is the latest but yes optimized to be faster and cheaper
o2 is the UK's largest mobile network operator. They bought naming rights to what was known as the millennium dome (not even a stadium).
Ahh makes sense :)
What is the o3 model good for? Is it just an evolution of o1 (chain of thought / internal reasoning)?
Yes
(albeit I believe o3-mini isn't natively multimodal)
I see, thank you.
Which one is the smartest, and most knowledgeable? (Like least likely to make up facts)
4o is going to be better for a straight up factual question
(But eg I asked it about something Martin Short / John Mulaney said on SNL and it needed 2 prompts to get the correct answer..... the first answer wasn't making anything up it was just reasonably misinterpreting something)
It also has web search which will be more accurate if the pages it reads are good (it uses bing search, so if possible provide your own links and forcibly enable web search)
Similarly the latest Anthropic Claude Sonnet model (it's the new Sonnet 3.5 as of ~Oct) is very good.
The idea behind o3 mini is that it only knows as much as 4o mini (the names suck, we know) but it will be able to consider its initial response and edit it if it doesn't meet the original prompt's criteria
how big were the image files in terms of size/resolution that go you the level of accuracy you needed with Claude?
300dpi (`magick -density 300 book.pdf page_%03d.png` was the command I used). The PDF is a from archieve.org & a very high-quality scan (https://ia601307.us.archive.org/5/items/derlgnertheori00rsuo...)
What about preserving the style like titles and subtitles?
You can request Markdown output, which takes care of text styling like italics and bold. For sections and subsections, in my own case they already have numerical labels (like "3.1.4") so I didn't feel the need to add extra formatting to make them stand out. Incidentally, even if you don't specify markdown output, Claude (at least in my case) automatically uses proper Unicode superscript numbers (like ¹, ², ³) for footnotes, which I find very neat.
Do you have a rough estimate of what the price per page was for this?
It must have been under $3 for the 150 or so API calls, possibly even under $2, though I'm less sure about that.
I made a high-quality scan of PAIP (Paradigms of Artificial Intelligence Programming), and worked on OCR'ing and incorporating that into an admittedly imperfect git repo of Markdown files. I used Scantailor to deskew and do other adjustments before applying Tesseract, via OCRmyPDF. I wrote notes for some of my process over at https://github.com/norvig/paip-lisp/releases/tag/v1.2 .
I'd also tried ocrit, which uses Apple's Vision framework for OCR, with some success - https://github.com/insidegui/ocrit
It's an ongoing, iterative process. I'll watch this thread with interest.
Some recent threads that might be helpful:
* https://news.ycombinator.com/item?id=42443022 - Show HN: Adventures in OCR
* https://news.ycombinator.com/item?id=43045801 - Benchmarking vision-language models on OCR in dynamic video environments - driscoll42 posted some stats from research
* https://news.ycombinator.com/item?id=43043671 - OCR4all
(Meaning, I have these browser tabs open, I haven't fully digested them yet)
Also this:
https://news.ycombinator.com/item?id=42952605 - Ingesting PDFs and why Gemini 2.0 changes everything
Was technology the right approach here? Is it essentially done now? I couldn’t tell if it was completed entirely.
I can’t help but think a few amateur humans could have read the pdf with their eyes and written the markdown by hand if the OCR was a little sketchy.
It's still in progress! It's looong - about a thousand pages. There's an ebook, but the printed book got more editing.
Copyright issues aside (e.g. if your thing is public domain), the galaxy-brain approach is to upload your raw scanned PDF to the Internet Archive (archive.org), fill in the appropriate metadata, wait about 24 hours for their post-upload format-conversion tasks to run automatically, and then download the size-optimized and OCR-ized PDF from them.
I've done this with a few documents from the French and Spanish national archives, which were originally provided as enormous non-OCRed PDFs but shrank to 10% the size (or less) after passage through archive.org and incidentally became full-text-searchable.
Last time I checked a few months ago, LLMs were more accurate than the OCR that the archive is using. The web archive version is/was not using context to figure out that for example “in the garden was a trge” should be “in the garden was a tree”. LLMs depending on the prompt do this.
Perhaps. My perhaps-curmudgeonly take on that is that it sounds a bit like "Xerox scanners/photocopiers randomly alter numbers in scanned documents" ( https://news.ycombinator.com/item?id=29223815 ). I'd much rather deal with "In the garden was a trge" than "In the garden was a tree," for example, if what the page actually said was "In the garden was a tiger." That said, of course you're right that context is useful for OCRing. See for example https://history.stackexchange.com/questions/50249/why-does-n...
Another, perhaps-leftpaddish argument is that by outsourcing the job to archive.org I'm allowing them to worry about the "best" way to OCR things, rather than spending my own time figuring it out. Wikisource, for example, seems to have gotten markedly better at OCRing pages over the past few years, and I assume that's because they're swapping out components behind the scenes.
Fair enough. Very valid points. I guess it boils down to “test both systems and see what works best for the task at hand”. I can indeed imagine cases were your approach would be the better option for sure.
The PDFs this process creates use MRC (Mixed Raster Content), which separates each page into multiple layers: a black and white foreground layer for text/line art, a color background layer for images/colors, and a binary mask layer that controls how they're combined. This smart layering is why you can get such small file sizes while maintaining crisp text and reasonable image quality.
If you want purely black and white output (e.g. if the PDF has yellowing pages and/or not-quite-black text, but doesn't have many illustrations), you can extract just the monochrome foreground layer from each page and ignore the color layers entirely.
First, extract the images using mutool extract in.pdf
Then delete the sRGB images.
Then combine the remaining images with imagemagick command line: convert -negate *.png out.pdf
This gives you a clean black and white PDF without any of the color information or artifacts from the background layer.
Here's a script that does all that. It worked with two different PDFs from IA. I haven't tested it with other sources of MRC PDFs. The script depends on mutool and imagemagick.
https://gist.github.com/rahimnathwani/44236eaeeca10398942d2c...
I have tried a bunch of things. This is what worked best for me: Surya [0]. It can run fully local on your laptop. I also tried EasyOCR [1], which is also quite good. I haven't tried this myself, but I will look at Paddle [2] if the previous two don't float your boat.
All of these are OSS, and you don't need to pay a dime to anyone.
[0]: https://github.com/VikParuchuri/surya
[1]: https://github.com/JaidedAI/EasyOCR
[2]: https://github.com/PaddlePaddle/Paddle
Got some questions (sorry for necro, but I only discovered this thread by accident because I left it open in a tab and it turns out to be super-relevant to me):
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. Any idea how I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?
Hey there, I don't know the answers to most of your question, honestly.
2. I think it would be enough. People do great work with much less.
3. I think Surya would handle it. I have done mostly flat text. I would also try some LLM OCR models like Google Gemini 2.0 Flash with different pipelines. With different system prompts. I am yet to do this. It would be easy to check. About fonts - never really worried about it myself. If it's something fancy, and you are crazy enough, you will create a font. Or you can also use some handwriting mimicry tool using another AI model. I don't have a name on top of my head. Look through OCR models. Indian college and HS kids still have to submit handwritten projects and assignments. Some crafty kids use such tools to type (or chatgpt copy-paste) and then print in pen ink color in their own handwriting, and fool the teacher given there are a large number of assignments to check.
4. I am not sure if I understand the question fully. Do you mean that books' pages will have numbers, and they will be read as book text in your OCRed data? If you mean that, then I just used GOF regex to root page numbers out. When you have the full text without page numbers, there are multiple tools to create EPUBs and PDF's. You can also reformat documents, assuming you already have an EPUB or PDF- based on the target device, using just Calibre.
1. I don't understand the question. You mean any other kind of scan than regular scanning? I don't know at all. I just work with regularly scanned documents.
Wow, Surya looks legit! https://www.datalab.to/
I would like to pay a dime and more for any of these solutions discussed in the thread as a normal MacOS program with a graphical user interface.
For classic books like those you mentioned, Project Gutenberg has text versions along with pdfs/epubs/etc.
For instance, Discourses on Livy:
https://www.gutenberg.org/cache/epub/10827/pg10827-images.ht...
https://www.gutenberg.org/ebooks/10827
Even better is when Standard Ebooks publishes a version: https://standardebooks.org/ebooks/niccolo-machiavelli/discou...
My understanding is that Gemini OCR is now considered state of the art and a material step forward in OCR accuracy
Is this from the article that was on the front page a few days ago? If so, it's not true. The title was intentionally misleading, they said they're the best, but if you read the article it was that they're actually the best in some subproblem, not the actual thing.
Yes. It’s still just an LLM and that means it can alter the meaning of entire passages in ways that are difficult to detect. This technology absolutely should not be used for OCR in domains where correctness matters.
I have not seen this answer so I’ll chime in:
There is a lot of enthusiasm around language models for OCR and I have found that generally they work well, however I have had much better results, especially if there are tables etc., by sending the raw page to the llm along with the ocrd page, and asking it transcribe from the image and validate words/character sequences against the ocr.
This largely solves for numbers and things being jumbled or hallucinated.
I recently tested llamaparse after trying it a year prior and was very impressed. You may be able to do your project on the free tier, and it will do a lot of this for you.
We do this in our Text to speech app (Read4Me): https://apps.apple.com/us/app/read4me-talk-browser-pdf-doc/i...
You can scan a book and listen (also copy and paste the text extracted to other apps).
If you are looking to do this on large scale in your own UI, I would recommend either of Google solutions:
1. Google Cloud Vision API (https://cloud.google.com/vision?hl=en)
2. Using Gemini API OCR capabilities.(Start here: https://aistudio.google.com/prompts/new_chat)
I haven’t seen anyone else mention this tool yet, but I’ve found great accuracy and flexibility with [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF). It (usually) detects and fixes page rotation, and works quite well on slanted text or A | B pages in regards to copying and formatting. I believe it uses tesseract in the background, but using it is very simple and it has the just works factor.
It’s surprising to me that no one has mentioned llamaparse. My team has been using it for a while and is quite satisfied. If other people think that other services are better then I’d be interested in hearing why.
hey, i recommend checking out the previous HN threads [1] on why LLMs shouldn’t be used in production-grade OCR, especially if accuracy is super important (as in the audiobook case)
we wrote the post and created Pulse [2] for these exact use cases, feel free to reach out for more info!
[1]: https://news.ycombinator.com/item?id=42966958 [2]: https://runpulse.com
For years I have been printing PDFs off on regular paper and then binding them into books. 1. Print it at work when no one is looking. 2. Get two rigid boards and squeeze the stack of paper together. I customarily use two wooden armrests that originally came from a garden-furniture lounger. 3. Squeeze the paper with just a 1/4-inch showing. 4. Use wood glue and with your finger working like a toothbrush, work the glue into the pages at the gluing end. 5. Get a 14-inch X 4-inch strip of canvas. I use cutoff painter's canvas. 6. Hang all this by the boards and put glue also on top of the canvas strip. 7. When it dries, remove the boards and glue down the sides. You have a strong, bound book out of those printed pages.
It’s unclear how this is related to the article, but I’m intrigued by your simple DIY bookbinding process.
It seems straightforward except for the canvas strip (I assume this is part of the binding?), and whether you add thicker pages/boards on each side as covers.
Do you have any photos of the process, or at least of a finished product? Thanks!
Docling is great for PDFs https://github.com/DS4SD/docling but if the input is really only images (in PDF) than cloud AI based solutions (like latest models from Google) may be better.
I made a site recently that works pretty well for this for a lot of sample scanned PDFs I tried, you might get good results:
https://fixmydocuments.com/
I also made a simple iOS app that basically just uses the built in OCR functionality on iPhones and automatically applies it to all the pages of a PDF. It won’t preserve formatting, but it’s quite accurate in terms of OCR:
https://apps.apple.com/us/app/super-pdf-ocr/id6479674248
Wow I asked a similar question a few days, glad to see this is getting some traction! Archive.org OCR is new to me! Very interesting. I am working on a tool to do OCR, translate and layout recovery (like .. all the apps nowadays), but focussed on running locally/processing thousands of pages/scans. Let me know if you want to collab!
More of a comment on your goal and something I’ve been thinking about with non-fiction books recently.
I’ve begun thinking I dislike the rigid format of non -fiction books. Few ideas bulked out and iterated, with lots of context and examples. Takes me ages to get through and I have very little free time. Cliffs notes are awful because you need some iteration and emotional connection to make the content stick.
I’d love a version of a book that has variable rates of summarization and is navigable around points and themes, so I can hop about while ensuring I don’t miss a key fact or insight buried somewhere
There are strategies you can employ for this. Many are spelled out in an HN fave, How to Read a Book (1940), by Mortimer J. Adler (<https://archive.org/details/howtoreadabook1940edition>, Wikipedia: <https://en.wikipedia.org/wiki/How_to_Read_a_Book>). It describes both types of books (fiction, instructive, and others), levels of reading, and specific reading strategies.
You can also pre-digest many such books in audio form (increasingly using what are now fairly powerful and tolerable text-to-speech tools), and dive in to read specific passages of note.
Because there's a formula to the book structure, you'll often find theory/overview / solutions presented in the introductory and concluding chapters or sections, with the mid-bulk section largely consisting of illustrations. There's an exceptionally ill-conceived notion that 1) one must finish all books one begins and 2) one must read all of a book. Neither of these are true, and I find books most usefully engaged as conversations with an author (books are conversations over time, telecoms are communications over space), and to read with a view to addressing specific goals: understanding of specific topics / problems / solutions, etc. This can cut your overall interaction with tedious works.
There's also of course a huge marketing dynamic to book publishing on which one of the best treatments is Arthur Schopenhauer's "On Authorship": (trans. 1897)
... Writing for money and reservation of copyright are, at bottom, the ruin of literature. No one writes anything that is worth writing, unless he writes entirely for the sake of his subject. What an inestimable boon it would be, if in every branch of literature there were only a few books, but those excellent! This can never happen, as long as money is to be made by writing. It seems as though the money lay under a curse; for every author degenerates as soon as he begins to put pen to paper in any way for the sake of gain. The best works of the greatest men all come from the time when they had to write for nothing or for very little. And here, too, that Spanish proverb holds good, which declares that honor and money are not to be found in the same purse—honora y provecho no caben en un saco. The reason why Literature is in such a bad plight nowadays is simply and solely that people write books to make money. A man who is in want sits down and writes a book, and the public is stupid enough to buy it. The secondary effect of this is the ruin of language. ...
<https://en.wikisource.org/wiki/The_Art_of_Literature/On_Auth...>
There was a thread here recently about OCR4All ( I haven't used any of these tools recently, but I'm keeping track because I might be doing that soon ).
https://news.ycombinator.com/item?id=43043671
https://www.ocr4all.org/
I recently used AWS Textract and had good results. There are accuracy benchmarks out there, I wish I saved the links, but I recall Gemini 2.0 and Textract towards the top in terms of accuracy. I also read that an LLM could extrapolate/conjure up cropped text therefore my idea would be to combine traditional OcR with LLM to determine conflicts.
Google Cloud Document AI is amazing, I love it https://cloud.google.com/document-ai?hl=en
It can correctly read many other languages than English if that is something you need. Previously I tried others and there were many errors in conversion. This does it well.
I’m curious about this api. I’d looked at it before but it didn’t seem like it could handle arbitrary input that didn’t fit one of the predefined schemas. I also wasn’t sure how much training data it needed. What has your experience been like?
I had to scan some historic books and papers for a research project and tried a couple desktop apps and python libraries. Still the Android scan to document worked better, so I just photographed all pages which was faster, then wrote an unsupervised python script to loop through the images in the camera folder and send them to Google Document AI for OCR, added some metadata and merged them. This outperformed the other even commercial solutions I tried by far and was dirt cheap. If you're interested I'll clean up the code and put it on github.
The article below compares 5 tools for a slightly different but close enough use case (more complex pdf than just running text). It concludes:
"Surprisingly, ChatGPT-4o gave the best Markdown output overall. Asking a multimodal LLM to simply convert a document to Markdown might be the best option if slow processing speed and token cost are not a problem."
https://ai.gopubby.com/benchmarking-pdf-to-markdown-document...
I was actually just working on a project like this to digitize an old manuscript. I used a PDF scanning app (there are plenty, I used Naps32, simple but it works). And then I piped the images into a `tesseract-ocr`. This will extract the text from the image but it won't deal with formatting or obvious typos. For that you're going to want to feed the text into an LLM with some prompt telling the model to correct errors, fix formatting, and provide clean text. Smaller local models (<70b parameters) do not work very well on this task for big documents, but I found ChatGPT's reasoning model does a fine job. My goal is to find a model that can run locally with similar performance.
Recommend the ABBY Finereader PDF app (their OCR engine is used by lots of other providers) - fantastic scanning and export to EPUB or other formats - used for ~300 different book across multitudes of languages for translation using AI.
The smart LLM's are great at this (Gemini Flash seems accurate and cheap), but they can't be trusted not to engage in unexpected censorship, typically skipping parts they find objectionable without reliably telling you that that's what they did. That's annoying enough if you're dealing with, e.g., names that happen to spell something awkward, but it's a big problem if you're scanning medical notes or something else where the awkward text is legitimately needed.
Anyone have success with prompting them to "just give me the text verbatim?"
The API has safety configuration for this
This might be of interest: https://www.expectedparrot.com/content/johnjhorton/grant-let... It uses a collection of models to extract the text from a handwritten letter by US Grant. Probably overkill for something nicely printed.
I've found there's a big difference in OCR accuracy where it comes to handwriting. For printed text, I've used tesseract, but it seems to miss a lot for handwriting. In my experience, google cloud vision is far more accurate at transcribing handwriting. Haven't tried other cloud based tools, so I couldn't tell you if it's better, but I would say that overall, the cloud based ones seem to be much better at handwriting or oddly formed text, but that for basic typeset printed text, open source apps like tesseract do well.
Give LLMWhisperer a try. Here is a playground for testing https://pg.llmwhisperer.unstract.com/
Just had an LLM write some code to Go from pdf to image to tesseract, it wasn’t even English, all local, free. It did an excellent job.
I think you answered it yourself, stick it into a multimodal LLM.
I recently did this for Indo-Aryan languages (Hindi, Gujarati) (800+ PDFs containing scanned images). I used Google Document AI (OCR), and PyDocx.
ocrmypdf with some specific prompts, depending on the source (language, force, etc.) worked for me most of the time. The biggest issue for which I have not been able to find a solution yet, is proper conversion of pdf to epub. I read a lot on my phone, and the inflexibility of pdf format, with the ugliness of "reflow" as the only apparent option to give reading the look of true epub, on phones, is frustrating.
Calibre's ebook-convert is the best PDF to X that I've found. Of course, it's not perfect but PDF is a really hard format to convert from.
Quick and easy: Gemini Flash 2
More of a system: AWS Textract or Azure Document Intelligence. This option requires some coding and the cost is higher than using a vision model.
This open source tool would probably work well for splitting into pages, sending to an LLM of your choice, and getting it back into a structured markdown file: https://github.com/getomni-ai/zerox
I haven’t used it yet, since my use case ended up not needing more than just sending to an LLM directly.
We are working on a project to have original language on the left and translation on the facing page. Instead of perfect translations or OCR, we try to report error rates with random sampling. We plan to have the texts editable on a Wikimedia server. Curious if you know similar efforts!
See this [1]
[1] https://blog.medusis.com/38_Adventures+in+OCR.html
This is somewhat unrelated, but I am curious what hardware there is for small scale printing (like an independent press) or scanning (for automatically converting books to digitally archive them for personal use). Does anyone in HN have recommendations?
Seeing blind recommendations for AI slop is very disappointing for HN.
For OP, there is a library written in rust that can do exactly what you need with very high accuracy and performant [1].
Would need to OCR dependencies to get it to work on scanned books [2].
[1] https://github.com/yobix-ai/extractous
[2] https://github.com/yobix-ai/extractous?tab=readme-ov-file#-s...
That looks rather nice, actually. Thanks.
I especially like the approach to graalify Tika.
Hi, you can try our app, algodocs.com to turn a scanned book into a pdf file. Its a free app. I hope this helps.
If you are using mac(book) the Preview apps has built-in OCR. If it detects a text in the page you can select, copy and paste it. But I don't know if it has cli or API to automate the process.
Just a thought.
I recently did this for Indo-Aryan languages (800+ PDFs containing scanned inages). I used Google Gemini,
Mathpix:
https://mathpix.com/pdf-conversion
I had very good experience with `gemini-2.0-flash-exp`:
https://github.com/maurycy/gemini-json-ocr
It's hard to know what to make of this because while you've included the output JSON you haven't included the input PDF so I have no idea how to interpret what it's actually doing.
Give it a try on any PDF! This is just 100 LOC, easy to audit.
I had great results with "Azure Document Intelligence Studio", followed by OpenAI's LLM. But this was half a year back and I wanted it to work via API.
ABBYY FineReader worked well enough the times i used it.
Perhaps building an agent in something like gumloop that loops page by page, does AI ocr, and then exports to a Google doc? Should take like 10 minutes to set up
A small script that just feeds every page to Gemini will likely beat every other method proposed in this thread.
The excellent document converter pandoc unfortunately only supports conversion to PDF.
I'm biased as an employee, but who knows PDFs better than Adobe? Use their PDF text extraction API.
As someone who's been using KDE's Okular PDF reader for nearly twenty years, and also has to use Adobe's products - can confidently say that at least one answer to your question is 'The developers of KDE's Okular'.
Some time ago I was toying around with a library called [MuPDF](https://www.mupdf.com/) for something related, and with that library and a small Python script you can programmatically OCR any book you want.
That library is free for personal or open source projects, but paid for commercial ones
depending on the length of these texts — and your technical ability — you might want to check out AWS Textract
it would be easy to set up a pipeline like:
> drop pdf in s3 bucket > eventbridge triggers step function > sfn calls textract > output saved to s3 & emailed to you
Paperless uses the latest traditional method. There are LLM enhancements you can download
https://linux.die.net/man/1/pdftotext
is the simplest thing that might work.
It is free and mature.
That will not work for scanned PDFs without a text layer and even if it has one, it's not guaranteed to work.
"Might work" comes with neither express nor implied warranty.
OCR is another thing that might work which is also simpler than an LLM.
On Linux there is ocrmypdf on fedora repos.
Works quite well
Docling
As of now Google Gemini
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
[flagged]
Two possibilities are "top of mind" for me:
You could script it using Gemini via the API[1].
Or use Tesseract[2].
[1]: https://ai.google.dev/
[2]: https://github.com/tesseract-ocr/tesseract