Tesseract OCR引擎概述：HP的开源OCR技术揭秘

需积分: 12 92 浏览量更新于2024-09-02 收藏 207KB PDF 举报

"这篇文档是2007年关于Tesseract OCR引擎的一篇综述，作者是Ray Smith。Tesseract是一个开源的OCR（光学字符识别）引擎，最初由HP公司在1984年至1994年间开发。在1995年的UNLV年度OCR准确性测试中，它因其出色的表现而引人注目，但随后又回到了开发时的保密状态。本文档首次揭示了其架构和算法的详细信息。 Tesseract OCR引擎的独特之处在于它的设计和算法，特别是行检测、特征/分类方法以及自适应分类器。文档首先介绍了Tesseract的起源和历史，起源于HP实验室的一个博士研究项目，旨在成为HP平板扫描仪的软件或硬件附加组件。当时的动力来自于市场上商业OCR引擎的局限性，激发了开发更高效解决方案的需求。在技术层面，文章深入探讨了以下几个关键点： 1. 行检测：OCR过程的第一步是找到文本行，Tesseract采用了创新的方法来识别并分离出图像中的文本行，这对于准确识别至关重要。 2. 特征/分类方法：OCR引擎需要识别和理解文本中的每个字符，这涉及到特征提取和分类。Tesseract可能使用了特定的特征提取技术，如边缘检测、形状分析等，然后通过机器学习模型进行分类。 3. 自适应分类器：这个部分可能涉及到了自适应学习，使得Tesseract能够根据不同的字体、大小和条件调整其识别策略，从而提高准确率。这种适应性使得OCR引擎在处理各种复杂文本时表现优秀。 4. 算法细节：尽管没有详细列出，但文档很可能涵盖了Tesseract如何利用模板匹配、神经网络或其他机器学习技术来改进字符识别的流程。 5. 开源社区的发展：Tesseract后来被开源并得到了持续发展，社区的贡献对优化和增强其功能起到了关键作用。这篇综述提供了关于Tesseract OCR引擎的宝贵洞察，展示了其在OCR领域的独特性和技术先进性。对于理解OCR技术的发展历程和核心机制，这篇文档无疑是一份重要的参考资料。"

An Overview of the Tesseract OCR Engine

Ray Smith

Google Inc.

theraysmith@gmail.com

Abstract

The Tesseract OCR engine, as was the HP Research

Prototype in the UNLV Fourth Annual Test of OCR

Accuracy[1], is described in a comprehensive

overview. Emphasis is placed on aspects that are novel

or at least unusual in an OCR engine, including in

particular the line finding, features/classification

methods, and the adaptive classifier.

1. Introduction – Motivation and History

Tesseract is an open-source OCR engine that was

developed at HP between 1984 and 1994. Like a super-

nova, it appeared from nowhere for the 1995 UNLV

Annual Test of OCR Accuracy [1], shone brightly with

its results, and then vanished back under the same

cloak of secrecy under which it had been developed.

Now for the first time, details of the architecture and

algorithms can be revealed.

Tesseract began as a PhD research project [2] in HP

Labs, Bristol, and gained momentum as a possible

software and/or hardware add-on for HP’s line of

flatbed scanners. Motivation was provided by the fact

that the commercial OCR engines of the day were in

their infancy, and failed miserably on anything but the

best quality print.

After a joint project between HP Labs Bristol, and

HP’s scanner division in Colorado, Tesseract had a

significant lead in accuracy over the commercial

engines, but did not become a product. The next stage

of its development was back in HP Labs Bristol as an

investigation of OCR for compression. Work

concentrated more on improving rejection efficiency

than on base-level accuracy. At the end of this project,

at the end of 1994, development ceased entirely. The

engine was sent to UNLV for the 1995 Annual Test of

OCR Accuracy[1], where it proved its worth against

the commercial engines of the time. In late 2005, HP

released Tesseract for open source. It is now available

at http://code.google.com/p/tesseract-ocr.

2. Architecture

Since HP had independently-developed page layout

analysis technology that was used in products, (and

therefore not released for open-source) Tesseract never

needed its own page layout analysis. Tesseract

therefore assumes that its input is a binary image with

optional polygonal text regions defined.

Processing follows a traditional step-by-step

pipeline, but some of the stages were unusual in their

day, and possibly remain so even now. The first step is

a connected component analysis in which outlines of

the components are stored. This was a computationally

expensive design decision at the time, but had a

significant advantage: by inspection of the nesting of

outlines, and the number of child and grandchild

outlines, it is simple to detect inverse text and

recognize it as easily as black-on-white text. Tesseract

was probably the first OCR engine able to handle

white-on-black text so trivially. At this stage, outlines

are gathered together, purely by nesting, into Blobs.

Blobs are organized into text lines, and the lines and

regions are analyzed for fixed pitch or proportional

text. Text lines are broken into words differently

according to the kind of character spacing. Fixed pitch

text is chopped immediately by character cells.

Proportional text is broken into words using definite

spaces and fuzzy spaces.

Recognition then proceeds as a two-pass process. In

the first pass, an attempt is made to recognize each

word in turn. Each word that is satisfactory is passed to

an adaptive classifier as training data. The adaptive

classifier then gets a chance to more accurately

recognize text lower down the page.

Since the adaptive classifier may have learned

something useful too late to make a contribution near

the top of the page, a second pass is run over the page,

in which words that were not recognized well enough

are recognized again.

A final phase resolves fuzzy spaces, and checks

alternative hypotheses for the x-height to locate small-

cap text.

下载后可阅读完整内容，剩余4页未读，立即下载

YoyageMax

粉丝: 0
资源: 1

Tesseract OCR引擎概述：HP的开源OCR技术揭秘

OCR 开源软件_tesseract

win7-8-10系统 ocr文字识别软件.rar

Adapting the Tesseract Open Source OCR Engine for Multilingual OCR

An overview of multi-task learning.pdf

全球云竞争概览 An Overview of Global Cloud Competition 202304.pdf

leverage-open-source-benefits-with-assurance-of-hitachi-overview.pdf

An Overview of Servlet and JSP Technology.zip

3G security----an overview

mc-so-supplier-excellence-overview-a4-zh-CN.pdf

mc-ds-solutions-overview-a4-zh-CN.pdf

最新资源