基于级联卷积神经网络的场景文本检测与分割方法

177 浏览量更新于2024-08-28 收藏 1.25MB PDF 举报

"基于级联卷积神经网络的场景文本检测与分割" 这篇研究论文"Scene Text Detection and Segmentation Based on Cascaded Convolution Neural Networks"深入探讨了计算机视觉领域中的两个关键且具有挑战性的问题：场景文本检测和分割。作者Youbao Tang和Xiangqian Wu提出了一种创新的方法，该方法利用级联卷积神经网络（CNNs）来实现这两项任务。首先，论文中设计并训练了一个基于CNN的文本感知候选文本区域（Candidate Text Region, CTR）提取模型，称为检测网络（DNet）。DNet利用文本的边缘信息和整个区域信息来检测粗略的CTR，从而减少了传统方法中候选区域的数量，同时保留了更多的真实文本区域。接下来，为了精确地分割这些粗略的CTR，构建了一个基于CNN的CTR细化模型，名为分割网络（SNet）。SNet的作用是对DNet检测出的粗略CTRs进行精细化处理，将它们分割成文本，进一步得到精炼的CTRs。最后，论文使用一个基于CNN的CTR分类模型，即分类网络（CNet），对精炼的CTRs进行分类，以获得最终的文本区域。所有这些CNN模型都进行了适当的修改和优化，以适应场景文本检测和分割的具体需求。这篇论文发表在2017年3月的IEEE Transactions on Image Processing, Vol. 26, No. 3上，展示了深度学习技术在解决复杂场景中的文本检测和分割问题上的强大潜力。这种方法通过级联多个CNN模型，实现了从粗到精的逐步处理，有效地提高了检测和分割的精度，减少了误检和漏检的情况。这项工作对于理解如何利用深度学习技术改进文本检测和分割算法具有重要意义，它不仅有助于提升自动文本识别系统的性能，还可能推动相关领域的技术进步，如自动驾驶、图像理解和智能监控等。

TANG AND WU: SCENE TEXT DETECTION AND SEGMENTATION BASED ON CASCADED CNN 1511

Fig. 3. The architecture of DNet.

To make DNet focus on the text regions, the information

which reﬂects the properties of the text is used as supervisory

information to train the CNN model. The shape of the text

regions is one of the most important pieces of information for

distinguishing between text and backgrounds. The edges and

the whole regions of text can represent its shape. As we know,

CNN learns local and global features as we move from the

shallow to deep layers. For text, the edges can be considered

as local information and the regions as global information.

Therefore, in this work, the edges and regions of text are used

as the supervisory information of the shallow and the deep

layers of the CNN for model training, respectively.

To get an accurate saliency prediction, the CNN architecture

should be deep and have multi-scale stages with different

strides, so that we can learn discriminative and multi-scale

features for pixels. Training such a deep network from scratch

is hard work when the number training samples available is

insufﬁcient. So this work chooses VGGNet-16 [41] trained on

the ImageNet dataset as the pre-trained model for ﬁne-tuning

as done by [53]. VGGNet-16 has six blocks. The ﬁrst ﬁve

blocks contain convolutional layers and pooling layers, and the

last block contains a pooling layer and fully connected layers.

The last block is removed in this work, since the pooling

operation in this block makes the feature maps become too

small (about 1



32 of the input image size) to obtain ﬁne

full-size prediction and the fully connected layers are time

and memory consuming. Based on the ﬁrst ﬁve blocks of

VGGNet-16, DNet is constructed as shown in Fig. 3. There

are thirteen convolutional layers and four max-pooling layers

in the ﬁve blocks. Speciﬁcally, the numbers of convolution

kernels in the ﬁve blocks are 64, 128, 256, 512 and 512. For all

convolutional layers, the size of all of the convolution kernels

is 3

∗

3, and both the stride and padding are 1 pixel. For all

max-pooling layers, the operation window size is 2

∗

2andthe

stride is 2 pixels.

As we know, the shallow CNN layers usually learn local

features, especially for edges and there are different kinds

of edges including those of text and background in natural

scene images. Therefore, only the edges of text are used as

supervisory information for the shallow layers to make the

CNN model pay more attention to the edges of text during

early feature learning in this work. Usually, the deep layers

learn the global object features. Therefore, the text regions

are used as the supervisory information for the deep layers, so

as to learn more discriminative global features to represent

the text properties. As we can see, from shallow to deep,

the whole DNet always focuses on learning the features of

text. We investigate which layers should be supervised by text

edges or regions and ﬁnd that the best performance is obtained

when the ﬁrst two and the last three blocks are supervised by

text edges and regions, respectively. He et al. [35] also used

text regions as supervisory information for model training, but

the text region was used as supervisory information in one

convolutional layer. In the proposed DNet model (as shown in

Fig. 3), the text regions are used as supervisory information in

four convolutional layers, and another important information

of text, i.e. the edges of text, is also used as supervisory

information in the shallow layers, which makes the model

pay more attention to the edges of text during early feature

learning.

To introduce the supervisory information into the CNN,

a side-output is generated, as done in [53], from the last

convolutional layer of each block with a convolutional layer

and a deconvolutional layer. The additional convolutional layer

with a 1

∗

1 convolution kernel converts the feature maps to a

probability map, and the additional deconvolutional layer is

used to render probability map the same size as the input

image. To make the ﬁnal probability map robust to the size

variation of the text, the side-outputs of the last three blocks

are fused by concatenating them in the channel direction

and using a convolution kernel of size 1

∗

1toconvertthe

concatenation maps to the ﬁnal probability map. In fact,

convolution with a 1

∗

1 kernel is a weighted fusion process.

Here, the side-outputs of the ﬁrst two blocks are not considered

during fusion, since we hope to capture the global information

of the text regions during text-aware saliency prediction and

we have found through experiments that it does not improve

the performance to add the ﬁrst two side-outputs during fusion.

So far, the whole architecture of DNet has been constructed,

as shown in Fig. 3. The additional deconvolutional layer and

last convolutional layer are followed by a sigmoid activation

function.

For DNet training, the errors between all side-outputs and

the supervisory ground truth should be computed and back-

ward propagated. Therefore, we need to deﬁne a loss function

to compute these errors. For most of the scene images, the

pixel numbers of text and background are heavy imbalanced.

Here, given an image X and its ground truth Y , a cross-

entropy loss function deﬁned in [53] is used to balance the

loss between text and non-text classes as follows:

side



W,w



=−α



i=1

log P



= 1|X; W,w



−

(

1 − α

)

−



i=1

log P



= 0|X; W,w



(1)

where α =

−

(

−

)

and

−

mean the num-

ber of text pixels and non-text pixels in the ground truth,

W denotes the parameters of all of the network layers in the

ﬁve blocks, w

denotes the weights of the m

side-output

layer including a convolutional layer and a deconvolutional

layer, and P

(

= 1|X; W,w

)

∈ [0, 1] is computed using

剩余11页未读，继续阅读

weixin_38526421

粉丝: 5

基于级联卷积神经网络的场景文本检测与分割方法

Image object detection and semantic segmentation based on convolutional neural network

Automatic Salient Object Segmentation Based on Context and Shape Prior

Practical Convolutional Neural Networks Implement advanced d l models using Py

Wang G , Ourselin S , Vercauteren T , et al. Automatic Brain Tumor Segmentation using Cascaded Anisotropic Convolutional Neural Networks[C]// Medical Image Computing and Computer-Assisted Intervention. 2017.

Region-Based Convolutional Networks for Accurate Object Detection and Segmentation复现

基于深度学习甲状腺结节的分割与识别的国内外研究现状，具体到文献

panda: adapting pretrained features for anomaly detection and segmentation

Transformer-Based Visual Segmentation: A Survey

deformable convolution

最新资源