"机器学习在文档分析与识别中的应用"

需积分: 8 132 浏览量更新于2023-12-19 收藏 4.51MB PDF 举报

机器学习在文档分析与识别中扮演着越来越重要的角色。本书通过汇集Simone Marinai和Hiromichi Fujisawa的编辑力量，探讨了机器学习在文档分析与识别中的广泛应用。作为该系列的主编，Janusz Kacprzyk教授从波兰科学院系统研究所出发，深入研究了这一领域，为我们在这本书中提供了宝贵的见解。机器学习技术已经被广泛应用于文档分析与识别领域。随着计算机视觉技术的不断发展，识别和理解文档中的信息已经成为一项重要的任务。例如，光学字符识别（OCR）系统利用机器学习算法来识别和转换印刷或手写文本为可编辑的文档。此外，机器学习也在文档分类和信息提取中发挥着重要作用，帮助用户更快地搜索和分析大量文档。本书还探讨了机器学习在文档分析与识别中的最新进展。例如，基于深度学习的方法在图像和文本识别方面取得了显著的进步。另外，集成了自然语言处理和机器学习技术的文本分析系统也表现出了强大的能力，能够进行情感分析、主题建模等任务。除了讨论新技术，本书还探讨了机器学习在文档分析与识别中的实际应用。例如，在金融领域，机器学习被用于分析大量的金融报表，帮助投资者做出更加明智的投资决策。在医疗领域，机器学习被用于分析医疗文档和影像，帮助医生进行疾病诊断和治疗方案制定。 Janusz Kacprzyk教授在本书中还总结了机器学习在文档分析与识别中的挑战和未来发展方向。随着大数据和云计算等技术的快速发展，我们将面临着越来越复杂的文档数据分析问题。因此，如何利用机器学习技术来处理这些挑战，将成为未来的研究重点。另外，深度学习、迁移学习等新兴技术也将在文档分析与识别中发挥重要作用，带来更加准确和智能的解决方案。综上所述，本书深入探讨了机器学习在文档分析与识别中的应用和发展。作为一本权威的参考书籍，它不仅总结了最新的研究成果，还展望了未来的发展方向。相信本书会对从事文档分析与识别领域的研究人员和工程师有着重要的指导作用，也将推动这一领域的快速发展。

48 D. Malerba et al.

2 Background and Related Works

In the literature there are already several publications on reading order detec-

tion. A pioneer work is reported in [5], where multi-column and multi-article

documents (e.g., magazine pages) with ﬁgures and photographs are handled.

Each document page is described as a tree, where each node, except the root,

represents a set of adjacent blocks located in the same column, ordered so

that the block on the upper location precedes the others. Direct descendants

of an internal node are also ordered in sequence according to their locations

in the same way that the block to the left and on the top precedes the others.

Reading order detection follows a preliminary rough classiﬁcation of layout

components into “title” and “body”. Heads are blocks in which there are

only a few text lines with large type fonts, while bodies correspond to blocks

with several text lines with small type fonts. The reading order is extracted

by applying some hand-coded rules which allow the transformation of trees

representing layout structures (with associated ‘title” and “body” labels) into

ordered structures. Once the correct reading order is detected, a further inter-

pretation step is performed to attach some logical labels (e.g., title, abstract,

sub-title, paragraph) to each item of the ordered structure.

A similar tree-structured representation of the page layout is adopted in

the work by Ishitani [6]. The structure is derived by a recursive XY-cut ap-

proach [7], that is, a recursive horizontal/vertical partitioning of the input

image. The XY-cut process naturally determines the reading order of the lay-

out components, since for horizontal cuts the top-bottom ordering is applied

to the derived sections, while for vertical cuts the right-left (i.e., Japanese

style) ordering is applied to the derived columns.

The main problem with this XY-cut approach is that at each recursion

step, there are often multiple possible, and possibly conﬂicting, cuts. In the

original algorithm, the widest cut is selected at each recursion. While this

strategy works reasonably well for a page segmentation task, it is not always

appropriate for a reading order detection task. For this reason, Ishitani pro-

posed a bottom-up approach using three heuristics which take into account

local geometric features, text orientation and distance among vertically adja-

cent layout objects in order to merge some layout objects before performing

the XY-cut. As observed by Meunier [8], this aims at reducing the probability

of having to face multiple cutting alternatives, but it does not truly prevent

them from occurring. For this reason, he proposed to reformulate the problem

of recursively cutting a page as an optimization problem, and deﬁned both a

scoring function for alternative cuts, and a computationally tractable method

for choosing the best partitioning.

A common aspect of all these approaches is that they are based exclusively

on the spatial information conveyed by a page layout. On the contrary, Taylor

et al. [9], propose the use of linguistic information to deﬁne the proper reading

order. For instance, to determine whether an article published in a magazine

ML for Reading Order Detection in Document Image Understanding 49

continues on the next page, it is suggested to look for a text, such as ‘continued

on next page’.

The usage of linguistic information has also been proposed by Aiello et al.

[10], who described a document analysis system for logical labelling and read-

ing order extraction of broad classes of documents. Each document object is

described by means of both attributes (i.e., aspect ratio, area ratio, font size

ratio, font style, content size, number of lines) and spatial relations (deﬁned as

extensions of Allen’s interval relations [11]). Only objects labelled with some

logical labels (title and body) are considered for reading order. More precisely,

two distinct reading orders are ﬁrst detected for the document object types

Title and Body, and then they are combined using a Title-Body connection

rule. This rule connects one Title with the left-most top-most Body object, sit-

uated below the Title. Each reading order is determined in two steps. Initially,

spatial information on the document objects is exploited by a spatial reasoner

which solves a constraint-satisfaction problem, where constraints correspond

to general document encoding rules (e.g., “in the Western-culture, documents

are usually read top-bottom and left-right”). The output of the spatial rea-

soner is a (cyclic) graph where edges represent instances of the partial ordering

relation BeforeInReading. A reading order is then deﬁned as a full path in this

graph, and is determined by means of an extension of a standard topological

sort [12]. Due to the generality of the document encoding rule used by the

spatial reasoner, it is likely that one obtains more than one reading order, es-

pecially for complex documents with many blocks. For this reason, a natural

language processor is used in the second step of the proposed method. The

goal is that of disambiguating between diﬀerent reading orders on the basis

of textual information of logical objects. This step works by computing prob-

abilities of sequences of words obtained by joining document objects which

are candidates to be followed in reading. The best aspect of this work is the

generality of the approach due to the generality of the knowledge adopted in

reasoning.

Topological sorting is also exploited in the approach proposed by Breuel

[13]. In particular, reading order is deﬁned the basis of text lines segments,

which are pairwise compared on the basis of four simple rules in order to de-

termine a partial order. Then a topological sorting algorithm is applied to ﬁnd

at least one global order consistent with this partial order. Columns, para-

graphs, and other layout features are determined on the basis of the spatial

arrangement of text line segments in reading order. For instance, paragraph

boundaries are indicated by relative indentation of consecutive text lines in

reading order.

All approaches reported above reﬂect a clear domain speciﬁcity. For in-

stance, the classiﬁcation of blocks as “title” and “body” is appropriate for

magazine articles, but not for administrative documents. Moreover, the doc-

ument encoding rules appropriate for Western-style documents are diﬀerent

for Japanese papers. Surprisingly, there is no work, to the best of our knowl-

edge, that handles the reading order problem by resorting to machine learning

50 D. Malerba et al.

techniques, which can generate the required knowledge from a set of train-

ing layout structures whose correct reading order has been provided by the

user. In previous works on document image analysis and understanding,

we investigated the application of machine learning techniques to several

knowledge-based document image processing tasks, such as classiﬁcation of

blocks according to their content type [14], automatic global layout analysis

correction [15], classiﬁcation of documents into a set of pre-deﬁned classes [16],

and logical labelling [17]. Experimental results always proved the feasibility of

this approach, at least on a small scale, that is, for a few hundred of training

document images. Therefore, following this mainstream of research, herein we

consider the problem of learning the deﬁnition of reading order.

The proposed solution has been tested by processing documents with WIS-

DOM++

, a knowledge-based document image processing system originally

developed to transform multi-page printed documents into XML format. WIS-

DOM++ makes extensive use of knowledge and XML technologies for seman-

tic indexing of paper documents. This is a complex process involving several

steps:

1. The image is segmented into basic layout components (basic blocks), which

are classiﬁed according to the type of content (e.g., text, pictures and

graphics).

2. A perceptual organization phase (layout analysis) is performed to detect

a tree-like layout structure, which associates the content of a document

with a hierarchy of layout components.

3. The ﬁrst page is classiﬁed to identify the membership class (or type) of

the multi-page document (e.g. scientiﬁc paper or magazine).

4. The layout structure of each page is mapped into the logical structure,

which associates the content with a hierarchy of logical components (e.g.

title or abstact of a scientiﬁc paper).

5. OCR is applied only to those logical components of interest for the appli-

cation domain (e.g., title).

6. The XML ﬁle that represents the layout structure, the logical structure,

and the textual content returned by the OCR for some speciﬁc logical

components is generated.

7. XML documents are stored in a repository for future retrieval purposes.

Four of seven processing steps make use of explicit knowledge expressed in the

form of decision trees and rules which are automatically learned by means of

two distinct machine learning systems: ITI [18], which returns decision trees

useful for block classiﬁcation (ﬁrst step), and ATRE [19], which returns rules

for layout analysis correction (second step) [15], document image classiﬁcation

(third step) and document image understanding (fourth step) [4]. As explained

in Section 4, ATRE is also used to learn the intensional deﬁnition of two

http://www.di.uniba.it/∼malerba/wisdom++/

剩余255页未读，继续阅读

承让@

粉丝: 8
资源: 380

"机器学习在文档分析与识别中的应用"

基于机器学习的验证码识别+源代码+文档说明

综合机器学习算法在遥感图像识别中的应用与分析

浅谈机器学习在短波信号调制识别中的应用.pdf

Python机器学习机器学习实战文档

机器学习在二进制漏洞挖掘中的应用文档分析

四种机器学习算法在遥感图像识别中的应用与效果

机器学习在手写数字识别中的应用

机器学习在农业病虫害识别中的应用：完整源码包

机器学习算法在遥感图像识别中的应用研究及实现

个性化驾驶风格识别：机器学习在驾驶行为分析中的应用

最新资源