0-7695-2822-8/07 $25.00 © 2007 IEEE 629
An Overview of the Tesseract OCR Engine
Ray Smith
Google Inc.
theraysmith@gmail.com
Abstract
The Tesseract OCR engine, as was the HP Research
Prototype in the UNLV Fourth Annual Test of OCR
Accuracy[1], is described in a comprehensive
overview. Emphasis is placed on aspects that are novel
or at least unusual in an OCR engine, including in
particular the line finding, features/classification
methods, and the adaptive classifier.
1. Introduction – Motivation and History
Tesseract is an open-source OCR engine that was
developed at HP between 1984 and 1994. Like a super-
nova, it appeared from nowhere for the 1995 UNLV
Annual Test of OCR Accuracy [1], shone brightly with
its results, and then vanished back under the same
cloak of secrecy under which it had been developed.
Now for the first time, details of the architecture and
algorithms can be revealed.
Tesseract began as a PhD research project [2] in HP
Labs, Bristol, and gained momentum as a possible
software and/or hardware add-on for HP’s line of
flatbed scanners. Motivation was provided by the fact
that the commercial OCR engines of the day were in
their infancy, and failed miserably on anything but the
best quality print.
After a joint project between HP Labs Bristol, and
HP’s scanner division in Colorado, Tesseract had a
significant lead in accuracy over the commercial
engines, but did not become a product. The next stage
of its development was back in HP Labs Bristol as an
investigation of OCR for compression. Work
concentrated more on improving rejection efficiency
than on base-level accuracy. At the end of this project,
at the end of 1994, development ceased entirely. The
engine was sent to UNLV for the 1995 Annual Test of
OCR Accuracy[1], where it proved its worth against
the commercial engines of the time. In late 2005, HP
released Tesseract for open source. It is now available
at http://code.google.com/p/tesseract-ocr.
2. Architecture
Since HP had independently-developed page layout
analysis technology that was used in products, (and
therefore not released for open-source) Tesseract never
needed its own page layout analysis. Tesseract
therefore assumes that its input is a binary image with
optional polygonal text regions defined.
Processing follows a traditional step-by-step
pipeline, but some of the stages were unusual in their
day, and possibly remain so even now. The first step is
a connected component analysis in which outlines of
the components are stored. This was a computationally
expensive design decision at the time, but had a
significant advantage: by inspection of the nesting of
outlines, and the number of child and grandchild
outlines, it is simple to detect inverse text and
recognize it as easily as black-on-white text. Tesseract
was probably the first OCR engine able to handle
white-on-black text so trivially. At this stage, outlines
are gathered together, purely by nesting, into Blobs.
Blobs are organized into text lines, and the lines and
regions are analyzed for fixed pitch or proportional
text. Text lines are broken into words differently
according to the kind of character spacing. Fixed pitch
text is chopped immediately by character cells.
Proportional text is broken into words using definite
spaces and fuzzy spaces.
Recognition then proceeds as a two-pass process. In
the first pass, an attempt is made to recognize each
word in turn. Each word that is satisfactory is passed to
an adaptive classifier as training data. The adaptive
classifier then gets a chance to more accurately
recognize text lower down the page.
Since the adaptive classifier may have learned
something useful too late to make a contribution near
the top of the page, a second pass is run over the page,
in which words that were not recognized well enough
are recognized again.
A final phase resolves fuzzy spaces, and checks
alternative hypotheses for the x-height to locate small-
cap text.