A New Training Principle for Stacked Denoising Autoencoders*
Qianhaozhe You
Department of Electronic Engineering
Tsinghua University
Beijing 100084, P.R.China
haozhe.yqhz@gmail.com
Yu-jin Zhang
Department of Electronic Engineering
Tsinghua University
Beijing 100084, P.R.China
zhang-yj@tsinghua.edu.cn
1
Abstract — In this work, a new training principle is introduced for
unsupervised learning that makes the learned representations
more efficient and useful. Using partially corrupted inputs instead,
the denoising Autoencoder can obtain more robust and
representative pattern of inputs than the traditional learning
methods. Besides, this denoising Autoencoder can be stacked to
form a deep network. The whole framework of training stacked
denoising Autoencoders, which involved several supervised
training methods in the framework, is given for image
classification. Comparative experiments have shown that the
model can resist noise of training examples powerfully and achieve
better accuracy of image classification on MNIST database.
Keywords-unsupervised learning; stacked denoising Auto-
encoders; image classification
I. INTRODUCTION
As one of the significant tasks in pattern recognition,
image classification technique has been widely developed in
recent years. The most widely-used framework was a kind
of discriminative model [3]. Hard-assignment vector
quantization [4] was the most popular method at that time
and it was extended to compensate the loss of spatial
information with spatial pyramid matching [3]. [5] adopted
the sparse coding algorithms for dictionary learning step in
the image classification framework. With the occurrence of
sparse coding, many extended methods, such as laplacian
sparse coding [6], kernel sparse representation [7] and
sparse coding with manifold projections [8], were proposed
and achieved better performance for image classification.
All these existing recognition approach extracted hand-
designed features which required time-consuming hand-
tuning. With the development of Deep Learning methods
[11], features could be learned by the machine itself during
the training process, instead being calculated by some
existing rules. Besides, the concept of deep architecture of
features was proposed to imitate the visual motions of
human brain. The Stacked Denoising Autoencoders [10] in
this paper, which is one of the deep neural networks, aimed
at extracting hierarchical features from images when
training and tuning the networks.
*
This work was partially supported by National Nature Science Foundation
(NNSF: 61171118) and Specialized Research Fund for the Doctoral
Program of Higher Education (SRFDP-20110002110057).
For visual recognition and pattern classification tasks, it
is difficult to learn either deep generative models or
discriminative models. Previous work has found the way to
overcome the difficulties by an advanced unsupervised
learning step which transforms input data to some related
intermediate representations in different space. In this paper,
an unsupervised pretraining step is introduced to initially
construct the model before supervised optimizing step.
Besides, training the model with elastic distortions has
shown surprisingly better performance for image
classification.
The rest of the paper is organized as follows. Section 2
introduces the related theory and models of the classical
Autoencoder, the denoising Autoencoder and the stacked
denoising Autoencoders. The implementation of the models
and some major algorithms are given in Section 3. Then,
analysis and discussions of the experimental results are
shown in Section 4. Finally, conclusions are drawn in
Section 5.
II. M
ODELS
In this section, the models used in this paper are defined
to form a deep network for image classification. Firstly, the
basic model, classical Autoencoder is introduced in detail.
With corrupted input instead, classical Autoencoder can be
improved and extended to denoising Autoencoder. Then,
several denoising Autoencoders are stacked layer by layer to
construct a deep network, which is called stacked denoising
Autoencoders.
A. The Classical Autoencoder
The classical Autoencoder takes a non-linear mapping
from the visible input x [0, 1]
d
to a hidden representation
y [0, 1]
d'
. Commonly, the sigmoid function is used to be
the deterministic mapping as follows:
() ( )yfx sWxb
(1)
The transformation above can be regarded as an encoding
step. Then, the hidden representation y is mapped back to
the input space through a similar transformation:
() ( )zgy sWyb
(2)
This reverse transformation can be regarded as a
decoding step. Here, W' can be optionally defined as the
transpose matrix of W, while b' and b are independent. The
relationship of x, y, z and the model of classical
Autoencoder are shown in Figure 1.
2013 Seventh International Conference on Image and Graphics
978-0-7695-5050-3/13 $26.00 © 2013 IEEE
DOI 10.1109/ICIG.2013.83
391
2013 Seventh International Conference on Image and Graphics
978-0-7695-5050-3/13 $26.00 © 2013 IEEE
DOI 10.1109/ICIG.2013.83
391
2013 Seventh International Conference on Image and Graphics
978-0-7695-5050-3/13 $26.00 © 2013 IEEE
DOI 10.1109/ICIG.2013.83
390
2013 Seventh International Conference on Image and Graphics
978-0-7695-5050-3/13 $26.00 © 2013 IEEE
DOI 10.1109/ICIG.2013.83
384
2013 Seventh International Conference on Image and Graphics
978-0-7695-5050-3/13 $26.00 © 2013 IEEE
DOI 10.1109/ICIG.2013.83
384