强化批处理表格OCR的稳定注册方法

需积分: 9 7 浏览量更新于2024-09-12 收藏 990KB PDF 举报

本文介绍了一种针对批量表格OCR处理的鲁棒表格注册方法（A Robust Table Registration Method for Batch Table OCR Processing）。在数字化大量扫描文档时，由于扫描效果、二值化或文档本身质量的影响，表格图像可能会严重降质，导致结构信息理解困难。传统的表格处理流程通常依赖于预先提供的表格模型来应对挑战性质量问题，如细胞尺寸的精确度要求相对较低，这使得提供表格模型的任务变得较为容易。该方法的主要目标是提升表格在多语言自动文档分类分析与翻译（MADCAT）等场景中的识别性能。作者Jinyu Zuo和Esin Darici来自PolarRain Inc., 位于加利福尼亚州坎贝尔的900E.Hamilton Suite 100。他们研究的焦点在于通过一种不需高度精确的表格模型，提高表格识别的稳定性和准确性，从而适应各类复杂情况下的表格数据处理。在实际应用中，该方法首先对输入的扫描表格图像进行预处理，可能包括噪声去除、边缘检测等步骤，以增强表格结构的清晰度。然后，利用特征提取算法（如行线、列线和单元格边界检测）来识别表格的基本框架。为了适应不同尺寸和布局的表格，方法可能采用模板匹配或者机器学习技术，如卷积神经网络（CNN），对候选区域进行匹配和评分，找到最佳的表格布局。在精度评估阶段，尽管表格模型的精确度不是主要关注点，但论文指出，通过该鲁棒方法，即使在表格尺寸有偏差的情况下，也能实现相对较好的识别结果。作者使用了MADCAT数据集进行了测试，结果显示这种方法在批量处理中展现出良好的性能，对于处理多语言和复杂表格结构具有实际价值。关键词：表格注册、MADCAT、文档处理这项工作为处理大规模、多样化的表格图像提供了一种实用且稳健的解决方案，有助于提高OCR系统在实际文档处理中的效率和准确性。随着数字化需求的增长，这种鲁棒的表格注册方法有望在文档管理和信息提取等领域发挥重要作用。

A Robust Table Registration Method for Batch Table OCR

Processing

Jinyu Zuo

Polar Rain, Inc.

900 E. Hamilton, Suite 100

Campbell, CA, 95008

jinyu.zuo@polarrain.com

Esin Darici

Polar Rain, Inc.

900 E. Hamilton, Suite 100

Campbell, CA, 95008

esin.darici@polarrain.com

ABSTRACT

A robust table registration method is proposed in this p aper

for a better understanding on structured information from

scanned table images. Scanned images can be heavily de-

graded because of scanning eﬀects, binarization or purely

docu ment itself. For batch processing images with the same

table structure, normally the table model is provided and

can be used to overcome most challenging quality factors.

The given table model is used as the ground truth in this

paper. H owever, only rough precision is needed on table cell

dimensions and this makes providing the table model an eas-

ier task. The metho d was tested on Multilingual Automatic

Docu ment Classiﬁcation Analysis and Translation (MAD-

CAT) images and a promising performance is achieved.

Keywords

Table registration, MADCAT, document processing

1. INTRODUCTION

While digitizing existing documents using Optical Char-

acter Recognition (OCR) technology is already a challeng-

ing task, putting table contents back into the original table

structure correctly is another active research topic. There

are many receipts, transcripts, library index, or personal in-

formation cards of employees need to be digitized. However,

how to put the recognized handwritten information back to

the proper ﬁeld, such as putting the name recognized from

the name ﬁeld back to the name ﬁle in t he XML ﬁle, is the

topic that will be discussed in this paper.

Surveys of table processing techniq ues and algorithms can

be found in [4] and [2]. Most techniques in those surveys are

for table detection or table structure analysis instead of reg-

istration. The most similar work can be found in literatures

is [3] where an unsupervised table registration algorithm was

proposed in order to recognize tabular contents. However,

their data were limited to census records.

For most normal tables, the table structure is deﬁned by a

set of horizontal and vertical lines. Those lines format table

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from Permissions@acm.org.

MOCR ’13 August 24 2013, Washington, DC, USA

contents to rows and columns. It is possible to estimate the

table structure directly from the image, but in this pap er it

is assumed that the table structure, which includes the ta-

ble header, the number of rows and columns, and even the

approximate cell size, is provided. The task of the table reg-

istration is matching the table model back to the scanned

image and then put ting the OCR results (words or para-

graphs) back to the table structure, such as a XML ﬁle, for

a better information organization. Figure 1 is an example

table image (a receipt) scanned at 300dpi resolution.

Figure 1: A typical receipt in China was s canned to

a 300dpi binary image.

1.1 Table Models

Based on the assumption that the table can be divided

to R rows and C columns, a table can be basically outlined

using R + 1 row lines and C + 1 column lines and all table

cells are rectangular. If a cell spans more t han 1 row or

column, then the table line segment inside of this cell will be

“erased”. Table models were provided as HTML (or XML)

ﬁles showing a series of cells with their speciﬁcations: row

index number, column index number, the number of rows it

spans, the number of columns it spans, cell height and cell

width. Some other extra information may also be provided,

such as if that cell is open on the left or right side. Based

on those table model ﬁles, tables can be fully reproduced.

For example, the given table mo del that matches the sample

image provided in Fig. 1 is plotted in Fig. 2.

1.2 Challenges

Even the table model is provided, there are several factors

that make the table registration a challenging task.

1.2.1 Cell Size Variance

下载后可阅读完整内容，剩余4页未读，立即下载

wdbsf

粉丝: 0

强化批处理表格OCR的稳定注册方法

Robust Coherence Processing 技术深度解析

"ANTs工具官方介绍文档详解：解决实际数据问题的实用示例与亮点

"线结构光振镜扫描测量系统通用标定方法研究与优化"。

Multi-core parallel robust structuredmultifrontal factorization method for large discretized PDEs

Progressive refinement for robust image registration

A Fast and Robust Level Set Method for Image Segmentation Using Fuzzy Clustering and LBM：这是我们论文IEEE TSMCB的实现。 2012.2218233-matlab开发

A novel robust MFCC extraction method using sample-ISOMAP for speech recognition

A robust method for inverse halftoning via two-dimensional nonlinear pyramid

A Robust Delaunay Triangulation Matching for Multispectral/Multidate Remote Sensing Image Registration

A Robust Loss for Point Cloud Registration.pdf

最新资源