OCR revolutions

Article by AI4Value CTO Pasi Karhu

I first got interested in AI when finding a book about “intelligent machines” in the library about forty years ago. I remember my awe over the many wonderful things that machines could do even then. But only two specific things that were imprinted in my brain for good. One was an algorithm used in chess programs (min-max) and the other one was the name Raymond Kurzweil, and how his machines could read printed text into computers. Kurzweil still works on AI (at Google) but is nowadays better know from his singularity predictions.

There were some simple OCR (Optical Character Recognition) programs already existing before Kurzweil’s, that could read some very specific font, but his machine could read any normal good-quality font, witch at that time was revolutionary. OCR has since then developed into a standard tool for everyone to use. Still, however, especially poor-quality prints and scans produce results riddled with errors after basic OCR.

Six years ago, I had a project with thousands of scanned paper documents with mostly poor quality. There were three main challenges in the text extracting: typewriter written text with worn ink ribbon (younger generation: google it :-), copies-of-copies from copy machines, and tilted scans.  Combine all of those, and you get OCR-results with half of the words somehow messed up.

I had to use all my existing tricks and invent some new ones. Especially useful was Ai4Value’s proprietary Automatic Ontology tool, that we still successfully use for many data cleaning tasks. While I finally managed to get most of the errors corrected and the resulting texts intelligible, many annoying smaller errors remained and would have required way too much work to fix.

Now, just six years later, generative AI models equipped with image recognition have revolutionized OCR once again. They can effortlessly perform the same tasks without the need for OCR and separate error corrections. Moreover, they can instantly extract desired information, such as all necessary details from a mix of paper invoices, into an easily processable digital format.

However, few tools work effectively without understanding their limitations. For instance, scanned data may contain internal company terms and codes that generative AI models have rarely, if ever, encountered. This can lead them to “hallucinate” or generate incorrect information, necessitating the use of “old school” methods to produce reliable results.

In many cases, the best outcomes are achieved through hybrid approaches that combine both traditional, established tools and the rapidly evolving generative AI. At Ai4Value, we have extensive experience with both, whether it be for OCR projects or other AI applications.