www.fgks.org   »   [go: up one dir, main page]

Showing posts with label OCR. Show all posts
Showing posts with label OCR. Show all posts

Friday, February 01, 2008

White Paper: Optimizing OCR Accuracy on Older Documents

I received a question about OCR, which led me to find this document revised in 2006 and published by the U.S. Government Printing Office (GPO):
Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products, by Jon M. Booth and Jeremy Gelb, Revised June 2006 (v.2)
This document is technical and not an overview of OCR, so it is not for everyone. The conclusions, though, are interesting:
  1. Older and discolored documents must be scanned in RGB mode to capture all the image data, and to maximize OCR accuracy.
  2. The character accuracy produced by scanning older documents in RGB mode meets (GPO’s meeting of the experts) 99% OCR accuracy requirement, even without applying file enhancement.
  3. No single type of file enhancement, applied individually, improves character recognition rates forOCR.
  4. Specifically, the Downsampling enhancement type does not improve character recognition rates, despite OCR software manufacturers’ claims that a 300 dpi is optimal for recognition rates.
In conclusion, the combination of these facts demonstrate that file enhancement is not needed, because the recognition rates are already at an acceptable level, and more importantly, it does not improve the character recognition rates for OCR.
What is amazing is that they did achieve 98-99% without -- seemingly -- much fuss.


Technorati tag:

Thursday, September 06, 2007

Article: Using “captchas” to digitize books

Captchas are "those strings of distorted characters that websites force you to recognize and type in order to establish that you are a person and not a malevolent computer." Now the pioneer of Captchas has found a way to put all of those -- and us -- to use doing something productive: helping to decipher words that can't be read by OCR (optical character recognition) in old books.

According to the article, Luis von Ahn...
created a tool, called ­"recaptcha," that pairs an unknown word with a known one. He distorts them both and puts a line through them--standard techniques for creating captchas. A user must decipher both captchas to access a site. The accurate typing of the known word serves the security purpose of captchas and adds a measure of confidence that the unknown word was identified correctly and can be used in place of the OCR's gibberish. Volunteers have begun deploying recaptchas, and the technique has been used to decipher two million words for the Internet Archive's book digitization effort.
The article does not say where this technology is being used. It would be cool to know where. I'd actually be interested in doing them just for fun (and to help out).


Addendum (2 p.m.): Thanks to Kathleen for checking out the web site and adding some info in a comment below. Yes...we can all help with this effort, although not a plug-in for Blogger yet.


Technorati tag:

Wednesday, April 11, 2007

OCRopus

Google is helping to develop OCRopus. The Google press release about OCRopus is here. The web site describes it as:

...a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

The web site goes onto say:

The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.

OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.

An alpha release of the product is scheduled for the third quarter of this year, so it looks like our benefiting from this may be a "ways off." However, it is good to see a major company working on this open source product.


Technorati tags: ,