N-fold Superposition: Improving Neural Networks by
Reducing the Noise in Feature Maps
Yang Liu, Qiang Qu, Chao Gao
National Digital Switching System Engineering & Technological R&D Center, Zhengzhou 450002, China
fabyangliu@hotmail.com
Abstract—Considering the use of Fully Connected (FC) layer
limits the performance of Convolutional Neural Networks
(CNNs), this paper develops a method to improve the coupling
between the convolution layer and the FC layer by reducing the
noise in Feature Maps (FMs). Our approach is divided into three
steps. Firstly, we separate all the FMs into n blocks equally. Then,
the weighted summation of FMs at the same position in all blocks
constitutes a new block of FMs. Finally, we replicate this new
block into n copies and concatenate them as the input to the FC
layer. This sharing of FMs could reduce the noise in them
apparently and avert the impact by a particular FM on the
specific part weight of hidden layers, hence preventing the
network from overfitting to some extent. Using the Fermat
Lemma, we prove that this method could make the global
minima value range of the loss function wider, by which makes it
easier for neural networks to converge and accelerates the
convergence process. This method does not significantly increase
the amounts of network parameters (only a few more coefficients
added), and the experiments demonstrate that this method could
increase the convergence speed and improve the classification
performance of neural networks.
Keywords—Convolutional Neural Networks; deep learning;
image classification; n-fold superposition; feature map sharing;
hidden layer weight sharing
I. INTRODUCTION
Neural networks especially Convolutional Neural Networks
(CNNs), have shown their remarkable performance in diverse
domains of computer vision [1]. The reason for such
impressive success CNNs have achieved mainly due to its
specially designed network structure, e.g., convolution, pooling.
Feature Maps (FMs), the abstract representation of the input
image, could be obtained by the convolution operation. With
the stack of convolution layers, high-level abstract feature
representation can be secured to understand and identify the
input image [2].
However, the use of coupling between the convolution
layer and the fully connected (FC) layer is the main reason
conventional CNNs overfits to the data or easily trapped in
local minima, with poor predictions [3, 4]. Many methods have
been developed in recent years to address these problems and
improve the performance of CNNs. These methods mainly
focus on the modifying of the network structure and
regularization strategies.
Reference [4-6] replace the conventional convolution
structure with a more vigorous approximation of a nonlinear
function, which helps convolution layers capture a higher level
of abstraction. However, this increases the amounts of network
parameters and computational complexity. The pooling layer is
usually used to abstract FMs to reduce overfitting, and the
commonly used pooling methods are max pooling and average
pooling [7]. Global average pooling [4] has been successfully
used in most fairly known convolutional neural networks [8-
10]. This method sums out the spatial information of each FM,
therefore, reinforces similarities and meanwhile reduce
differences in the spatial information, and makes it more robust
to spatial translations of the input (which is very desirable).
Reference [4] even tried to replace FC layer with this method
to erase the effect of FC layer on the classification performance.
But this makes it harder for neural networks to learn thus slow
down the convergence process.
Softmax loss function is wildly used in most CNNs [11];
nevertheless, it is biased to the sample distribution, so it also
becomes a major improvement goal for researchers. By adding
a decision variable to softmax loss, the loss function could be
explicitly encouraged intra-class compactness and inter-class
separability between learned features, with avoiding overfitting
[12]. However, this method needs repeated fine-tuning which
makes the training difficult. Regularization term could also be
added as a constraint to the loss function to prevent the model
from overfitting, such as squared L
2
norm constraint on the
weight [13]. This kind of regularization method intends to
smaller the network weights and make the model simple to
reduce overfitting. Dropout [14, 15] randomly drop units
(along with their connections) from the network during training.
Dropout forces the randomly selected neurons to work together
to prevent units from co-adapting too much and improves the
generalization ability of models.
Other strategies, e.g., batch normalization [16] reduces the
internal-covariate-shift by normalizing input distributions of
every layer to the standard Gaussian distribution. Initialization
methods [17, 18] derive more robust initialization method that
particularly considers the nonlinearities in neural networks.
Data augmentation methods [19, 20] could make the model
more robust and prevent overfitting when the training set is
limited. These methods also provide us with valid attempts to
improve the performance of neural networks, besides the
modification of the network structure and regularization.