深度学习实践：卷积神经网络在视觉文档分析中的最佳策略

卷积神经网络

需积分: 10 160 浏览量更新于2024-09-10 收藏 291KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"本文主要探讨了卷积神经网络在视觉文档分析中的最佳实践，强调了增大训练集规模、采用卷积神经网络结构以及简洁的卷积实现方法的重要性。" 卷积神经网络（Convolutional Neural Networks，简称CNN）是深度学习领域中用于图像识别和处理的关键技术。它们在计算机视觉任务中表现出色，特别是在视觉文档分析中，能够有效地理解和解析文本、表格和其他复杂布局。本文由Patrice Y. Simard、Dave Steinkraus和John C. Platt三位来自微软研究团队的专家撰写，旨在提供一套实用的CNN最佳实践。首先，作者强调了构建大规模训练集的重要性。在训练神经网络时，大量的标注数据能够帮助模型更好地泛化，减少过拟合的风险。为了扩大训练集，他们提出了一种新的数据增强方法，即引入扭曲数据，这可以模拟真实世界中可能出现的各种变化，如光照、角度、缩放等，使模型更具鲁棒性。其次，文章指出在视觉文档分析任务中，卷积神经网络相对于全连接网络（Fully Connected Networks）具有显著优势。CNN通过使用卷积层来捕获图像的局部特征，并利用池化层降低计算复杂度，保持空间结构信息。此外，它们还有权值共享的特性，降低了参数数量，有助于防止过拟合。作者提出，对于许多视觉文档问题，可以采用一种简单的自定义卷积实现。这种实现方式具有灵活的架构，不需要复杂的优化技巧，如动量（Momentum）或权重衰减（Weight Decay）。这种方法使得研究人员即使没有复杂的工具库也可以高效地应用CNN。在实践中，构建CNN时还需要注意以下几点： 1. **初始化策略**：合适的权重初始化对网络的收敛速度和最终性能至关重要。可以使用Xavier初始化或He初始化来平衡输入层和隐藏层之间的方差。 2. **激活函数**：ReLU（Rectified Linear Unit）通常作为默认的激活函数，因为它可以解决梯度消失问题，提高训练效率。但也要留意可能的“死亡ReLU”问题。 3. **批量归一化**（Batch Normalization）：通过规范化每一层的输入，可以加速训练过程并提高模型的泛化能力。 4. **损失函数**：根据任务类型选择适当的损失函数，如交叉熵损失（Cross-Entropy Loss）适用于分类任务，均方误差（Mean Squared Error）适合回归任务。 5. **优化器**：虽然文中提到简单实现不需动量，但现代实践中，Adam、RMSprop等自适应学习率优化器通常能取得更好的结果。理解并遵循这些最佳实践，将有助于视觉文档分析研究人员构建更高效、更准确的CNN模型，从而在文档识别、文本检测等领域取得更好的性能。

资源详情

资源推荐

Best Practices for Convolutional Neural Networks

Applied to Visual Document Analysis

Patrice Y. Simard, Dave Steinkraus, John C. Platt

Microsoft Research, One Microsoft Way, Redmond WA 98052

{patrice,v-davste,jplatt}@microsoft.com

Abstract

Neural networks are a powerful technology for

classification of visual inputs arising from documents.

However, there is a confusing plethora of different neural

network methods that are used in the literature and in

industry. This paper describes a set of concrete best

practices that document analysis researchers can use to

get good results with neural networks. The most

important practice is getting a training set as large as

possible: we expand the training set by adding a new

form of distorted data. The next most important practice

is that convolutional neural networks are better suited for

visual document tasks than fully connected networks. We

propose that a simple “do-it-yourself” implementation of

convolution with a flexible architecture is suitable for

many visual document problems. This simple

convolutional neural network does not require complex

methods, such as momentum, weight decay, structure-

dependent learning rates, averaging layers, tangent prop,

or even finely-tuning the architecture. The end result is a

very simple yet general architecture which can yield

state-of-the-art performance for document analysis. We

illustrate our claims on the MNIST set of English digit

images.

1. Introduction

After being extremely popular in the early 1990s,

neural networks have fallen out of favor in research in the

last 5 years. In 2000, it was even pointed out by the

organizers of the Neural Information Processing System

(NIPS) conference that the term “neural networks” in the

submission title was negatively correlated with

acceptance. In contrast, positive correlations were made

with support vector machines (SVMs), Bayesian

networks, and variational methods.

In this paper, we show that neural networks achieve

the best performance on a handwriting recognition task

(MNIST). MNIST [7] is a benchmark dataset of images

of segmented handwritten digits, each with 28x28 pixels.

There are 60,000 training examples and 10,000 testing

examples.

Our best performance on MNIST with neural networks

is in agreement with other researchers, who have found

that neural networks continue to yield state-of-the-art

performance on visual document analysis tasks [1][2].

The optimal performance on MNIST was achieved

using two essential practices. First, we created a new,

general set of elastic distortions that vastly expanded the

size of the training set. Second, we used convolutional

neural networks. The elastic distortions are described in

detail in Section 2. Sections 3 and 4 then describe a

generic convolutional neural network architecture that is

simple to implement.

We believe that these two practices are applicable

beyond MNIST, to general visual tasks in document

analysis. Applications range from FAX recognition, to

analysis of scanned documents and cursive recognition

(using a visual representation) in the Tablet PC.

2. Expanding Data Sets through Elastic

Distortions

Synthesizing plausible transformations of data is

simple, but the “inverse” problem – transformation

invariance – can be arbitrarily complicated. Fortunately,

learning algorithms are very good at learning inverse

problems. Given a classification task, one may apply

transformations to generate additional data and let the

learning algorithm infer the transformation invariance.

This invariance is embedded in the parameters, so it is in

some sense free, since the computation at recognition

time is unchanged. If the data is scarce and if the

distribution to be learned has transformation-invariance

properties, generating additional data using

transformations may even improve performance [6]. In

the case of handwriting recognition, we postulate that the

distribution has some invariance with respect to not only

affine transformations, but also elastic deformations

corresponding to uncontrolled oscillations of the hand

muscles, dampened by inertia.

Simple distortions such as translations, rotations, and

skewing can be generated by applying affine

displacement fields to images. This is done by computing

for every pixel a new target location with respect to the

original location. The new target location, at position

(x,y) is given with respect to the previous position. For

instance if ∆x(x,y)=1, and ∆y(x,y)=0, this means that the

new location of every pixel is shifted by 1 to the right. If

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)

下载后可阅读完整内容，剩余5页未读，立即下载

brownwj

粉丝: 0
资源: 1

深度学习实践：卷积神经网络在视觉文档分析中的最佳策略

gated recurrent convolution neural network for ocr

Deep Learning for Computer Vision - Introduction to Convolution Neural Networks

Write a Style transfer program based on convolution neural network with python, and save the training weight in a file separately.

Error: CuDNN isn't found FWD algo for convolution

employed 9×9 convolution kernels to replace part of the 3×3 convolution kernels

Dual graph convolutional network

deformable convolution

deformable convolution network加在FPN-inception模块的哪个位置比较好

out = self.inp_prelu(self.inp_snorm(self.inp_conv(x)))

import torch.nn as nn import torch.nn.functional as F import torch.optim as optim

torch.backends.cudnn.benchmark

dcnn和cnn差別

Convolution和Transpose Convolution

convolution

tiled convolution

未定义标识符 "CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT"

pytorch textcnn 今日头条

nn.ConvTranspose2d(128, 64, kernel_size=(2, 2), stride=(2, 2))

最新资源