Batch-normalized Mlpconv-wise... 143
optimization algorithm based upon a supervised training
criterion is used to fine-tune the deep multilayer neural
network.
The deep learning (DL) paradigm brought about a revival
in deep multilayered neural network research, and has
attracted unprecedented attention because of its success in
several areas, including vision and language recognition [3–
5]. The objective of DL approaches is learning a hierarchical
model from the input characteristics. Lower-layer charac-
teristics in the hierarchical model are combined to form
higher-layer characteristics. Deep learning has been demon-
strated to be able to learn many hierarchical characteristics
automatically, which are then combined within an integrated
network [6].
Although the ability of hierarchical neural networks [7,
8] to learn characteristics is useful for pattern analysis, there
are still many problems to be solved in the DL paradigm. For
example, the characteristics learned in hidden layers are not
always transparent in their meaning, particularly for early
hidden layers, the discrimination ability may occasionally
decrease [9], the fade-away gradient may make it hard to
train a deep network [10], and overfitting may occur when
very little training data is available [11]. Recent techniques
such as dropout [11] and dropconnect [12] are used to reg-
ulate deep networks and avoid overfitting. The idea behind
these techniques is to randomly drop units or connections
to prevent units from co-adapting, which has been shown to
improve classification performance in numerous studies.
Because data generally come from nonlinear manifold
distributions, they are not linearly separable. To realize the
abstraction and acquisition of a larger amount of informa-
tion in the receptive fields, the network in network (NIN)
[13] model uses an mlpconv layer, where a multilayer
perceptron (MLP) convolves the input to enhance the non-
linearity of local patches. Thus, the discrimination ability
of the model is improved. Companion objective functions
are used to constrain the weights in hidden layers in deeply
supervised nets (DSNs) [14], so that robust features can
be captured in the first few layers of a deep convolutional
neural network (CNN).
The disappearance of the gradient is essentially the prob-
lem of gradient shrinkage propagating backwards through
the hidden layers. It is noteworthy that some successful
approaches have used the strategy of adding hidden layers
to networks in a constructive manner [1]. To construc-
tively formulate a desirable internal representation, using
a supervised criterion in each phase provides straightfor-
ward supervision. However, it has been reported that using
a supervised criterion in each phase may be too greedy and
may not obtain as good generalization performance as using
an unsupervised criterion [15]. Another issue is that the data
distribution can vary during the DL procedure. Variations in
the data distribution may result in saturation of the activa-
tion function, shifting input data into the saturation region
of the activation function and reducing the learning speed.
This phenomenon is referred to as internal covariate shift
[16]. Ioffe et al. [17] addressed this issue by applying batch
normalization to the input of every hidden layer.
Changing the bottom layer weights is necessary for back
propagation through many layers, thus resulting in the phe-
nomenon of vanishing gradients. A variety of approaches
and parameter setting methods, such as pre-training, have
been proposed to achieve better training of deep neural
networks. In conventional greedy layer-wise supervised pre-
training methods, each new hidden layer is trained as the
hidden layer of a single-hidden-layer supervised neural net-
work, with the input being the output of the previously
trained layers [8, 15, 18]. The output layer is then discarded,
and the trained hidden layer is used as the pre-training ini-
tialization. It is expected that this approach will yield a
preferable representation. However, the greedy layer-wise
supervised pre-training method may be too greedy. The
learned hidden units representation could neglect some
important information about the learning target when this
information is not easily acquired by a single-hidden-layer
neural network. However, this information could be suc-
cessfully acquired using deep structures. In this paper,
basedontheNIN[13] structure, we present a new DL
approach called mlpconv-wise supervised pre-training NIN
(MPNIN).
The central idea of MPNIN is to use integrated direct
supervised training in the hidden layers, rather than the stan-
dard approach of implementing supervised training only in
the output layer and back propagating this supervision infor-
mation to earlier layers. We implement this integrated direct
hidden layer supervised training by introducing mlpconv-
wise supervised pre-training to each hidden layer. An mlp-
con
v layer consists of a linear convolutional layer and a
two-layered MLP. Each mlpconv layer that is pre-trained
with supervision is used as the hidden layer of a single-
hidden-layer supervised neural network. During the super-
vised pre-training, each new mlpconv layer takes as input
the output of the previously trained mlpconv layers. We
use batch normalization to normalize the inputs and reduce
the effects of internal covariate shift. The output layer is
then discarded, and the trained mlpconv layer is used as
the initialized hidden layer. The experimental results in this
paper verify the robustness and discrimination ability of the
features learned by the proposed MPNIN model.
Our motivations behind the development of the proposed
MPNIN network and the novelty and contributions of this
research can be summarized as follows.