©ACM, 2009. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for
redistribution. The definitive version was published in Proceedings of the International Workshop on Multilingual OCR 2009, Barcelona,
Spain July 25, 2009. http://doi.acm.org/10/1145/1577802.1577804.
Adapting the Tesseract Open Source OCR Engine for
Multilingual OCR
Ray Smith Daria Antonova Dar-Shyang Lee
Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA.
Abstract
We describe efforts to adapt the Tesseract open source OCR
engine for multiple scripts and languages. Effort has been
concentrated on enabling generic multi-lingual operation such
that negligible customization is required for a new language
beyond providing a corpus of text. Although change was required
to various modules, including physical layout analysis, and
linguistic post-processing, no change was required to the
character classifier beyond changing a few limits. The Tesseract
classifier has adapted easily to Simplified Chinese. Test results on
English, a mixture of European languages, and Russian, taken
from a random sample of books, show a reasonably consistent
word error rate between 3.72% and 5.78%, and Simplified
Chinese has a character error rate of only 3.77%.
Keywords
Tesseract, Multi-Lingual OCR.
1. Introduction
Research interest in Latin-based OCR faded away more than a
decade ago, in favor of Chinese, Japanese, and Korean (CJK)
[1,2], followed more recently by Arabic [3,4], and then Hindi
[5,6]. These languages provide greater challenges specifically to
classifiers, and also to the other components of OCR systems.
Chinese and Japanese share the Han script, which contains
thousands of different character shapes. Korean uses the Hangul
script, which has several thousand more of its own, as well as
using Han characters. The number of characters is one or two
orders of magnitude greater than Latin. Arabic is mostly written
with connected characters, and its characters change shape
according to the position in a word. Hindi combines a small
number of alphabetic letters into thousands of shapes that
represent syllables. As the letters combine, they form ligatures
whose shape only vaguely resembles the original letters. Hindi
then combines the problems of CJK and Arabic, by joining all the
symbols in a word with a line called the shiro-reka.
Research approaches have used language-specific work-arounds
to avoid the problems in some way, since that is simpler than
trying to find a solution that works for all languages. For instance,
the large character sets of Han, Hangul, and Hindi are mostly
made up of a much smaller number of components, known as
radicals in Han, Jamo in Hangul, and letters in Hindi. Since it is
much easier to develop a classifier for a small number of classes,
one approach has been to recognize the radicals [1, 2, 5] and infer
the actual characters from the combination of radicals. This
approach is easier for Hangul than for Han or Hindi, since the
radicals don't change shape much in Hangul characters, whereas
in Han, the radicals often are squashed to fit in the character and
mostly touch other radicals. Hindi takes this a step further by
changing the shape of the consonants when they form a conjunct
consonant ligature. Another example of a more language-specific
work-around for Arabic, where it is difficult to determine the
character boundaries to segment connected components into
characters. A commonly used method is to classify individual
vertical pixel strips, each of which is a partial character, and
combine the classifications with a Hidden Markov Model that
models the character boundaries [3].
Google is committed to making its services available in as many
languages as possible [7], so we are also interested in adapting the
Tesseract Open Source OCR Engine [8, 9] to many languages.
This paper discusses our efforts so far in fully internationalizing
Tesseract, and the surprising ease with which some of it has been
possible. Our approach is use language generic methods, to
minimize the manual effort to cover many languages.
2. Review Of Tesseract For Latin
Fig. 1 is a block diagram of the basic components of Tesseract.
The new page layout analysis for Tesseract [10] was designed
from the beginning to be language-independent, but the rest of the
engine was developed for English, without a great deal of thought
as to how it might work for other languages. After noting that the
commercial engines at the time were strictly for black-on-white
text, one of the original design goals of Tesseract was that it
should recognize white-on-black (inverse video) text as easily as
black-on-white. This led the design (fortuitously as it turned out)
in the direction of connected component (CC) analysis and
operating on outlines of the components. The first step after CC
Find Text
Lines and
Words
Recognize
Word
Pass 1
Fuzzy Space &
x-height
Fix-up
Recognize
Word
Pass 2
Page Layout
Analysis
Blob Finding
Input: Binary Image
Component Outlines
In Text Regions
Character
Outlines
In Text Regions
Character
Outlines
Organized
Into Words
Output: Text
Find Text
Lines and
Words
Recognize
Word
Pass 1
Fuzzy Space &
x-height
Fix-up
Recognize
Word
Pass 2
Page Layout
Analysis
Blob Finding
Input: Binary Image
Component Outlines
In Text Regions
Character
Outlines
In Text Regions
Character
Outlines
Organized
Into Words
Output: Text
Figure 1. Top-level block diagram of Tesseract.