深度学习视角下的自然场景文本检测与识别研究进展

需积分: 12 11 浏览量更新于2024-07-15 1 收藏 8.53MB PDF 举报

"最新《自然场景中文本检测与识别》综述论文探讨了计算机视觉领域中的关键问题——自然图像中的文本检测与识别。这些问题在体育视频分析、自动驾驶和工业自动化等多个领域具有重要应用。文本的表示方法以及环境条件对其影响是该领域面临的共同挑战。尽管深度学习架构的进步已显著提升了场景文本检测和识别的准确性，但在处理多分辨率、多方向文本时表现出优越性能，但仍然存在一些挑战，如模型泛化能力不足和标注数据有限，导致现有方法在处理野生图像中的文本时表现不佳。这篇综述的目标是不同于以往的研究，旨在全面分析现有方法并揭示未来研究方向。" 《自然场景中文本检测与识别》这篇综述论文深入剖析了当前计算机视觉技术在自然图像文本处理方面的进展和难题。文本检测和识别是计算机视觉的重要组成部分，它们在实际应用中扮演着不可或缺的角色，比如在体育赛事视频分析中自动识别比分和球员信息，自动驾驶中识别路标和交通标志，以及工业自动化中读取生产线上的文字信息。随着深度学习技术的发展，特别是卷积神经网络（CNN）和循环神经网络（RNN）等模型的广泛应用，文本检测和识别的准确率有了显著提升。这些模型能够处理不同尺度和方向的文本，大大增强了在基准数据集上的表现。然而，尽管深度学习方法在某些方面取得了突破，但仍然面临一些关键挑战。例如，模型对于未见过的数据的泛化能力不足，这可能导致在实际复杂环境中出现错误。此外，大量标注数据的缺乏也限制了模型的训练和优化，使得模型难以应对现实世界中的各种变异性。综述论文指出，为了克服这些挑战，研究者需要探索新的表示方法，以增强模型对文本的抽象理解和适应性。同时，无监督学习或半监督学习的方法可能有助于减少对大量标注数据的依赖。此外，强化学习和迁移学习也可能为提高模型的泛化能力提供新途径。《自然场景中文本检测与识别》这篇综述论文旨在总结当前的技术成果，分析存在的问题，并为未来的研究指明方向，以推动这个领域进一步发展，更好地服务于实际应用。通过深入理解这些挑战和潜在解决方案，我们可以期待在文本检测和识别技术上取得更大的突破，从而在各个应用领域实现更智能、更可靠的自动化。

Text Detection and Recognition in the Wild: A Review 5

Table 1: Deep learning text detection methods, where W: Word, T: Text-line, C: Character, D: Detection, R: Recognition,

RB: Region-proposal-based, SB: Segmentation-based, ST: Synthetic Text, IC15: ICDAR15, IC13: ICDAR13, M500: MSRA-

TD500, IC17: ICDAR17MLT,TOT: TotalText, CTW:CTW-1500 and the rest of the abbreviations used in this table are pre-

sented in Table 2.

Method Year

IF Neural Network

Detection Target

Challenges

Task Code Model

Training Dataset reported in Paper

†

RB SB Hy Architecture Backbone Quad Curved

Jaderberget al.[33] 2014 – – CNN – W – – D,R – DSOL MJSynth

Huang et al. [30] 2014 – – CNN – – – D – RSTD IC05,IC11

Tian et al. [34] 2016 3 – Faster R-CNN VGG-16 T,W – – D 3 CTPN PD+IC13

Zhang et al. [39] 2016 – 3 FCN VGG-16 W 3 – D 3 MOTD IC13,IC15,M500

Yao et al. [40] 2016 – 3 FCN VGG-16 W 3 – D 3 STDH IC13,IC15,M500

Shi et al. [71] 2017 3 – SSD VGG-16 C,W 3 – D 3 SegLink ST+(IC13,IC15,M500)

He et al. [91]. 2017 – 3 SSD VGG-16 W 3 – D 3 SSTD IC13+IC15+PD

Hu et al. [93] 2017 – 3 FCN VGG-16 C 3 – D – Wordsup CLS+(ST+IC15)+(ST+COCO)

Zhou et al. [35] 2017 3 – FCN VGG-16 W,T 3 – D 3 EAST COCO,IC15,M500

He et al. [94] 2017 3 – DenseBox – W,T 3 – D – DDR IC13+IC15+PD

Ma et al. [38] 2018 3 – Faster R-CNN VGG-16 W 3 – D 3 RRPN M500+(I15,I13)

Jiang et al. [95] 2018 3 – Faster R-CNN VGG-16 W 3 – D 3 R2CNN ICDAR+PD

Long et al. [42] 2018 – 3 U-Net VGG-16 W 3 3 D 3 TextSnake ST+(IC15,M500,TOT,CTW)

Liao et al. [37] 2018 3 – SSD VGG-16 W 3 – D,R 3 TextBoxes++ ST+(IC15)

He et al. [50] 2018 – 3 FCN PVA C,W 3 – D,R 3 E2ET ST+(IC15,IC13)

Lyu et al. [48] 2018 – 3 Mask-RCNN ResNet-50 W 3 – D,R 3 Mask-TextSpotter ST+(IC15,IC13,TOT)

Liao et al. [96] 2018 3 – SSD VGG-16 W 3 – D 3 RRD ST+(IC15,IC13,COCO,M500)

Lyu et al. [97] 2018 – 3 FCN VGG-16 W 3 – D 3 MOSTD ST+(IC,IC13)

Deng et al.*[43] 2018 3 – FCN VGG-16 W 3 – D 3 Pixellink IC15+(IC15,M500,IC13)

Liu et al.*[49] 2018 3 – CNN ResNet-50 W 3 – D,R 3 FOTS ST+(IC15,IC13,IC17)

Baek et al.*[46] 2019 – 3 U-Net VGG-16 C,W,T 3 3 D 3 CRAFT ST+(IC15,IC13,IC17)

Wang et al.*[98] 2019 – 3 FPEM+FFM ResNet-18 W 3 3 D 3 PAN ST+(IC15,M500,TOT,CTW)

Liu et al.*[47] 2019 – – 3 Mask-RCNN ResNet-50 W 3 3 D 3 PMTD IC17,(IC15,IC13)

Xu et al. [99] 2019 – 3 FCN VGG-16 W 3 3 D 3 Textﬁeld ST+(IC15, M500,TOT,CTW)

Liu et al. [100] 2019 – 3 Mask-RCNN ResNet-101 W 3 3 D 3 BOX ST+(IC15,IC17,M500)

Wang et al.* [101] 2019 – 3 FPN ResNet W 3 3 D 3 PSENet IC17+IC15

Note: * The method has been considered for evaluation.

† Trained dataset/s used in the original paper, The datasets inside parenthesis show the ﬁne-tuned model. In this survey, we used a pre-trained

model of ICDAR15 dataset for evaluation to compare the results in a uniﬁed framework.

Table 2: Supplementary table of abbreviations.

Attribution Description

FCN [102] Fully Convolutional Neural Network

FPN [103] Feature pyramid networks

PVA-Net [104] Deep but Lightweight Neural Networks for Real-time

Object Detection

RPN [105] Region Proposal Network

SSD [106] Single shot detector

U-Net [107] Convolutional Networks developed for Biomedical

Image Segmentation

FPEM [98] Feature Pyramid Enhancement Module

FFM [98] Feature Fusion Module

To address the problem of linked neighbour characters,

Pixellinks [43] leveraged 8-directional information for each

pixel to highlight the text margin, and Lyu [44] proposed

corner detection method to produce position-sensitive score

map. In [42], TextSnake was proposed to detect text in-

stances by predicting the text regions and the center-line

together with geometry attributes. This method does not

require character-level annotation and is capable of recon-

structing the precise shape and regional strike of text in-

stances. Inspired by [93], character afﬁnity maps were used

in [46] to connect detected characters into a single word

(a) (b)

Fig. 4: Semantic vs. instance segmentation. Groundtruth an-

notations for (a) semantic segmentation, where very close

characters are linked together, and (b) instance segmenta-

tion. The image comes from the public dataset in [114].

Note, this ﬁgure is best viewed in color format.

and a weakly supervised framework was used to train a

character-level detector. Recently, in [101] a progressive

scale expansion network (PSENet) was introduced to ﬁnd

kernels with multiple scales and separate text instances close

to each other accurately. However, the methods in [46, 101]

require large number of images for training, which increase

the run-time and can present difﬁculties on platforms with

limited resources.

6 Zobeir Raisi

et al.

Recently, several works [47, 48, 115, 116] have treated

scene text detection as an instance segmentation problem,

an example is shown in Fig. 4(b), and many of them have

applied Mask R-CNN [112] framework to improve the per-

formance of scene text detection, which it is useful to im-

plement instance segmentation of text regions and seman-

tic segmentation of word-text instances and make possible

detecting arbitrary shape of text instances. For example, in-

spired by Mask R-CNN, SPCNET [116] uses a text context

module to detect text of arbitrary shape and a re-score mech-

anism to suppress false positives.

However, the methods in [48, 115, 116] have some draw-

backs, which may decline their performance: Firstly, they

suffer from the errors of bounding box handling in a com-

plicated background, where the predicted bounding box fails

to cover the whole text image. Secondly, these methods

[48, 115, 116] aim at separating text pixels from the back-

ground ones that can lead to many mislabeled pixels at the

text borders [47].

Hybrid methods [91, 96, 97, 117] use segmentation-

based approach to predict score maps of text and aim to

obtain text bounding-boxes through regression. For exam-

ple, single-shot text detector (SSTD) [91] used an atten-

tion mechanism to enhance text regions in image and re-

duce background interference on the feature level. Liao

et al. [96] proposed rotation-sensitive regression for ori-

ented scene text detection, which makes full use of rotation-

invariant features by actively rotating the convolutional ﬁl-

ters. However, this method is incapable of capturing all the

other possible text shapes that exist in scene images [46].

Lyu et al. [97] presented a method that detects and groups

corner points of text regions to generate text boxes. Be-

side detecting long oriented text and handling considerable

variation in aspect ratio, this method also requires simple

post-processing. Liu et al. [47] proposed a new Mask R-

CNN-based framework, namely, pyramid mask text detec-

tor (PMTD) that assigns a soft pyramid label, l ∈ [0, 1],

for each pixel in text instance, and then reinterprets the ob-

tained 2D soft mask into the 3D space. They also introduced

a novel plane clustering algorithm to infer the optimal text

box on the basis of the 3D shape and achieved the state-of-

the-art performance on ICDAR LSVT dataset [118]. PMTD

also achieved superior performance on multi-oriented text

datasets [62, 119]. However, due to its framework that de-

signed explicitly for multi-oriented text, it is hard to lever-

age it on curved-text datasets [63, 120], and only the results

of bounding box using polygon bounding boxes are shown

in the paper.

In th following section we describe the scene text recog-

nition algorithms.

2.2 Text Recognition

The scene text recognition task aims to convert detected text

regions into characters or words. Case sensitive character

classes often consist of: 10 digits, 26 lowercase letters, 26

uppercase letters, 32 ASCII punctuation marks, and the end

of sentences (EOS) symbol. However, text recognition mod-

els proposed in the literature have used different choices of

character classes, which Table 3 provides their numbers.

Since the properties of scene text images are different

from that of scanned documents, it is difﬁcult to develop

an effective text recognition method based on a classical

machine learning method, such as [121–126]. As we men-

tioned in Section 1, this is because images captured in the

wild tend to include text under various challenging condi-

tions such as images of low resolution [64, 127], lightning

extreme [64, 127], environmental conditions [62, 114], and

have different number of fonts [62, 114, 128], orientation

angles [63, 128], languages [119] and lexicons [64, 127].

Researchers proposed different techniques to address these

challenging issues, which can be categorized into classical

machine learning-based [20, 28, 29, 64, 84, 123] and deep

learning-based [32, 52–55, 55, 129–140] text recognition

methods, which are discussed in the rest of this section.

2.2.1 Classical Machine Learning-based Methods

In the past two decades, traditional scene text recognition

methods [28, 29, 123] have used standard image features,

such as HOG [73] and SIFT [141], with a classical machine

learning classiﬁer, such as SVM [142] or k-nearest neigh-

bors [143], then a statistical language model or visual struc-

ture prediction is applied to prune-out mis-classiﬁed charac-

ters [1, 80].

Most classical machine learning-based methods follow a

bottom-up approach that classiﬁed characters are linked up

into words. For example, in [20, 64] HOG features are ﬁrst

extracted from each sliding window, and then a pre-trained

nearest neighbor or SVM classiﬁer is applied to classify the

characters of the input word image. Neumann and Matas

[84] proposed a set of handcrafted features, which include

aspect and hole area ratios, used with an SVM classiﬁer for

text recognition. However, these methods [20, 22, 64, 84]

cannot achieve either an effective recognition accuracy, due

to the low representation capability of handcrafted features,

or building models that are able to handle text recogni-

tion in the wild. Other works adopted a top-down approach,

where the word is directly recognized from the entire in-

put images, rather than detecting and recognizing individual

characters [144]. For example, Almazan et al. [144] treated

word recognition as a content-based image retrieval prob-

lem, where word image and word labels are embedded into

an Euclidean space and the embedding vectors are used to

剩余25页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习视角下的自然场景文本检测与识别研究进展

深度学习驱动的场景文本检测与识别进展综述

自然场景文本检测最新进展：算法解析与应用

彩色图像中文本检测与识别的深度调研与挑战

基于深度学习的场景文字检测与识别综述.pdf

最新《场景文本识别》2020综述论文

最新人脸识别论文综述

image-text-localization-recognition：图像文本本地化和识别的一般资源列表场景文本位置感知与识别的论文资源与实现合集集ーンシストの位置认识と识别の论文の论文リ论文ースの要约

生物特征识别的综述论文

综述：自然场景图像中的文本检测

「图像视频深度异常检测」简明综述论文

最新资源