InftyReader - an OCR System for Math Documents

Masakazu Suzuki and Katsuhito Yamaguchi

the Infty Project and Science Accessibility Net

and John Gardner

ViewPlus Technologies, Inc.

Updated June 25, 2011


Introduction

Overview

Optical character recognition (OCR) technologies are invaluable for improving access to printed materials by people with print disabilities.Most do not work well with scientific content. Scientific documents typically include mathematical and other special symbols that standard OCR does not recognize. In addition, no standard commercial OCR application can recognize a two-dimensionally-structured math equation and convert it to a standard software format.

The InftyReader OCR application can properly recognize scientific documents scanned from paper or in PDF format. InftyReader recognizes complicated math expressions, tables, graphs and other technical notations and converts them to accessible formats.

InftyReader can be used by people with print disabilities in combination with the ChattyInfty accessible scientific editor application. ChattyInfty provides speech and braille access for reading and writing math and editing the output of InftyReader. Sighted people can use InftyReader with the free Infty Editor to edit InftyReader output and produce accessible scientific content.

InftyReader and ChattyInfty are made by the Infty Project and are sold through Science Accessibility Net.

InftyReader Features

Output Formats

InftyReader can output a recognition result in any of the following formats

IML is a XML file format developed expressely for InftyEditor and ChattyInfty. By default InftyReader will save results in IML. The original image is retained and can be displayed with either Infty Editor or ChattyInfty. In ChattyInfty, the image can be accessed tactually through an on-line graphics display (available from KGS Japan) or by embossing it on a ViewPlus embosser. Consequently Infty Editor and ChattyInfty users may compare results with the original image and make corrections as necessary. These editors can also convert the result into any of the formats listed above.

Other than IML, allInfty formats except HR-TeX are standard mainstream forms. HR-TeX (Human-Readable TeX), is an abbreviated LaTeX-like notation developed to be more easily readable than standard LaTeX.

System Requirements

InftyReader and ChattyInfty require Windows XP, Vista, or Windows7 operating systems, 32 or 64 bit. Microsoft Internet Explorer7 or later must be installed. In order to correct OCR errors and edit documents, we strongly recommend that the free Infty editor be installed for use by sighted people or the ChattyInfty editor for use by people with print disabilities.

Installation Procedure

The initial InftyReader or ChattyInfty downloaded archive is a zip file. Extract the contents into any convenient folder. One file ends in "setup.exe". Runm this file to install InftyReader or ChattyInfty and all components necessary for their use. Note that administrator privileges are required to install applications.

License

The InftyReader license is included in the download archive as "License_E.txt". Please read it!

Optimizing the Quality of OCR Recognition

InftyReader can recognize only a high quality black-and-white (binary) image. For paper documents, it is very important to scan the document in black and white (binary) mode and to use a resolution of at least 400 dpi. 600 dpi is recommended for best results. The paper should be flat and carefully aligned in the scanner to avoid images that are fuzzy, skewed, or slanted. If possible, pages in books should be cut from the binding so they will lie flat on the scanner. Save the scanned files as TIFF, GIF or PNG format.

Recognizing math characters requires much higher quality images than does standard OCR, and poor quality images will give correspondingly poor results. Heavy users of InftyReader can improve the quality of recognition by editing scanned images to remove small extraneous scan defects and artifacts. Recognition can be improved by optimizing the scanner threshold so that fewer than 1% of characters are broken or touch other characters.

InftyReader subdivides the document into text, math, tables, and figures and then uses different procedures to recognize each. Users can improve the quality of recognition by hand-editing images to ensure that the content flows properly. For example, cutting columns apart and arranging them in proper sequence is recommended when pages are partially columnized.

One common problem that needs to be avoided with scanned images is a dark frame that can appear around the page. This problem is caused by non-white area above the page during scanning. It can be avoided by placing a large white paper over the page being scanned. The paper should be large enough to cover the entire scanner surface. Images with such problems can be repaired by removing the offending dark frame in a good image editor. Be careful not to reduce the image dpi during such a process.

A PDF file also can be recognized. Normal PDF files have characters represented in fonts and are subject to fewer OCR problems than scanned images. Always obtain a PDF if possible. Articles in most scientific journals are available as PDF.

Step by Step Instructions for Using InftyReader

InftyReader is a GUI application. It can be used in command mode in the Console Window by running Infty.exe from the Infty program folder. This tutorial is restricted to the GUI version. Command mode use is covered in the InftyHelpE.txt file included in the archive.

Feedback

InftyReader and ChattyInfty are made by the Infty Project and are sold through Science Accessibility Net.

The InftyReader developers welcome bug reports and suggestions for improvement. Please send these to support[at]mail.sciaccess.net, (please change [at] by the atmark).