端到端RNN OCR技术：图像序列识别新突破

需积分: 9 84 浏览量更新于2024-09-08 收藏 1.01MB PDF 举报

"这篇论文提出了一种基于循环神经网络（RNN）的端到端OCR（光学字符识别）技术，该技术特别适用于场景文本识别，无需进行字符分割和水平缩放，仅需对输入图像在垂直方向上进行标准化处理即可处理任意长度的序列。这种方法将特征提取、序列建模和转录集成在一个统一的框架中，具有训练端到端、能处理任意长度序列等优势。" 基于RNN的端到端OCR识别技术是计算机视觉领域中的一个重要研究方向，特别是在场景文本识别中，其目标是从图像中自动识别出文字序列。传统的OCR系统通常包括多个独立训练和调优的组件，如特征提取、字符分割、水平尺度规范化等步骤，这些步骤可能增加了系统的复杂性和错误传播的可能性。本文提出的新型神经网络架构摒弃了这些繁琐的预处理步骤，它整合了整个识别过程，从原始图像输入到最终的字符序列输出，形成一个完全端到端的训练模型。这种一体化设计使得网络可以直接学习从原始像素到字符序列的映射，减少了人工设计和优化中间步骤的需求。 RNN，全称为循环神经网络，是一种擅长处理序列数据的深度学习模型。在OCR任务中，RNN可以捕获字符之间的上下文依赖关系，这对于理解场景文本特别关键，因为场景文本往往包含连写、变形或不规则排布的字符。通过LSTM（长短期记忆网络）或GRU（门控循环单元）等变体，RNN能够有效地解决长期依赖问题，进一步提高识别准确率。论文中提到的架构自然地处理任意长度的序列，这意味着对于不同长度的文本，网络都能够适应性地建模和识别，这在处理真实世界中的场景文本时非常有用，因为它们的长度往往变化无常。没有字符分割和水平尺度规范化，该方法能够更直接地处理图像中的文本，降低了处理复杂性的门槛。此外，该端到端的训练方式使得模型能够从大量标注数据中学习到更泛化的模式，提高整体性能。训练过程中，损失函数可以直接反馈到网络的所有层，使得模型能够自我调整以优化整体识别效果，而不是仅仅优化每个独立模块。基于RNN的端到端OCR识别技术通过简化传统OCR流程，提高了处理效率和准确性，尤其适用于复杂和多变的场景文本识别任务。这种技术的出现，不仅推动了OCR领域的进步，也为其他需要处理序列数据的领域提供了新的思路和方法。

An End-to-End Trainable Neural Network for Image-based Sequence

Recognition and Its Application to Scene Text Recognition

Baoguang Shi, Xiang Bai and Cong Yao

School of Electronic Information and Communications

Huazhong University of Science and Technology, Wuhan, China

{shibaoguang,xbai}@hust.edu.cn, yaocong2010@gmail.com

Abstract

Image-based sequence recognition has been a long-

standing research topic in computer vision. In this pa-

per, we investigate the problem of scene text recognition,

which is among the most important and challenging tasks

in image-based sequence recognition. A novel neural net-

work architecture, which integrates feature extraction, se-

quence modeling and transcription into a uniﬁed frame-

work, is proposed. Compared with previous systems for

scene text recognition, the proposed architecture possesses

four distinctive properties: (1) It is end-to-end trainable,

in contrast to most of the existing algorithms whose compo-

nents are separately trained and tuned. (2) It naturally han-

dles sequences in arbitrary lengths, involving no character

segmentation or horizontal scale normalization. (3) It is not

conﬁned to any predeﬁned lexicon and achieves remarkable

performances in both lexicon-free and lexicon-based scene

text recognition tasks. (4) It generates an effective yet much

smaller model, which is more practical for real-world ap-

plication scenarios. The experiments on standard bench-

marks, including the IIIT-5K, Street View Text and ICDAR

datasets, demonstrate the superiority of the proposed algo-

rithm over the prior arts. Moreover, the proposed algorithm

performs well in the task of image-based music score recog-

nition, which evidently veriﬁes the generality of it.

1. Introduction

Recently, the community has seen a strong revival of

neural networks, which is mainly stimulated by the great

success of deep neural network models, speciﬁcally Deep

Convolutional Neural Networks (DCNN), in various vision

tasks. However, majority of the recent works related to deep

neural networks have devoted to detection or classiﬁcation

of object categories [12, 25]. In this paper, we are con-

cerned with a classic problem in computer vision: image-

based sequence recognition. In real world, a stable of vi-

sual objects, such as scene text, handwriting and musical

score, tend to occur in the form of sequence, not in isola-

tion. Unlike general object recognition, recognizing such

sequence-like objects often requires the system to predict

a series of object labels, instead of a single label. There-

fore, recognition of such objects can be naturally cast as a

sequence recognition problem. Another unique property of

sequence-like objects is that their lengths may vary drasti-

cally. For instance, English words can either consist of 2

characters such as “OK” or 15 characters such as “congrat-

ulations”. Consequently, the most popular deep models like

DCNN [25, 26] cannot be directly applied to sequence pre-

diction, since DCNN models often operate on inputs and

outputs with ﬁxed dimensions, and thus are incapable of

producing a variable-length label sequence.

Some attempts have been made to address this problem

for a speciﬁc sequence-like object (e.g. scene text). For

example, the algorithms in [35, 8] ﬁrstly detect individual

characters and then recognize these detected characters with

DCNN models, which are trained using labeled character

images. Such methods often require training a strong char-

acter detector for accurately detecting and cropping each

character out from the original word image. Some other

approaches (such as [22]) treat scene text recognition as

an image classiﬁcation problem, and assign a class label

to each English word (90K words in total). It turns out a

large trained model with a huge number of classes, which

is difﬁcult to be generalized to other types of sequence-

like objects, such as Chinese texts, musical scores, etc., be-

cause the numbers of basic combinations of such kind of

sequences can be greater than 1 million. In summary, cur-

rent systems based on DCNN can not be directly used for

image-based sequence recognition.

Recurrent neural networks (RNN) models, another im-

portant branch of the deep neural networks family, were

mainly designed for handling sequences. One of the ad-

vantages of RNN is that it does not need the position of

each element in a sequence object image in both training

and testing. However, a preprocessing step that converts

arXiv:1507.05717v1 [cs.CV] 21 Jul 2015

下载后可阅读完整内容，剩余8页未读，立即下载

vswhs

粉丝: 0
资源: 2

端到端RNN OCR技术：图像序列识别新突破

Python-基于RNNCTC损失函数的端到端语音识别系统

cnn +rnn +attention 以及CTC-loss融合的文字识别代码，要的拿去不客气，样本使用自我合成的数据，可自己添加

CRNN_CTC_OCR models.rar

基于tensorflow、keraspytorch框架实现图片文字检测及端到端的OCR文字识别.zip

Python-基于tensorflowkeraspytorch实现对自然场景的文字检测及端到端的OCR中文文字识别

python基于tensorflow、keraspytorch实现对自然场景的文字检测及端到端的OCR中文文字识别.zip

【毕业设计】基于tensorflow、keras_pytorch实现对自然场景的文字检测及端到端的OCR中文文字识别.zip

基于tensorflow、keras/pytorch实现对自然场景的文字检测及端到端的OCR中文文字识别项目源码+模型+数据集

OCR识别:拍摄图片识别文字可以自动校正倾斜的图片.zip

最全的OCR图像识别技术源码内有说明.rar

最新资源