Sensors 2020, 20, 5097 5 of 24
introduce sampling noise, which appears in training data sets but not in real test datasets even if
both are drawn from the same distribution. This scenario leads to overfitting and there have been
several strategies [
26
] to tackle the problem, such as early stopping of the training epochs and weight
penalties (L1 and L2 regularizations, soft weight sharing, and pooling). Ensemble models of several
CNNs with different configurations on the same dataset are known for their overfitting. However,
this leads to extra computational and maintenance cost for training several models. Moreover, training
a large network requires large datasets, but the availability of such datasets in the field of medical
imaging is very rare. Even if one can train large networks with a versatile setting of parameters,
testing these networks is not feasible in a real-time situation due to the nature of medical imaging
systems. In the case of ensemble models, a CNN model can also simulate multiple configurations just
by probabilistically dropping out edges and nodes. Dropout is a kind of regularization technique to
reduce overfitting by temporarily dropping a unit out of the network [
27
]. This simple idea shows a
significant improvement in CNN performance.
Batch normalization: The input of each hidden layer dynamically changes during training because
the parameters in the previous layer update at each training epoch. If these changes are large, the search
for an optimal hyperparameter becomes difficult for the network and may be computationally expensive
to reach an optimal value. This problem can be solved by an algorithm called batch normalization,
which was proposed by two researchers [
28
]. Batch normalization allows the use of a higher learning
rate and thereby achieves the optimal value in less time. It facilitates the smooth training of deeper
network architectures in less time. The normalization of data from a particular batch is about finding
the mean and variance of the data points from mini-batch and normalizing them to have a zero mean
and unit variance.
In backward pass, the CNN adjusts its weights and parameters according to the output by
calculating the error through some loss functions,
e
(other names are cost function and error function)
and backpropagating the error with some rules towards the input. The loss is calculated by taking the
partial derivative of
e
w.r.t., which is the output of each neuron in that layer, such as
∂e/y
`
i,j,k
for the
output,
y
`
i,j,k
of
(
i, j, k
)
th
unit in layer
`
. The cFhain rule allows us to write and add up the contribution
of each variable as follows:
∂e
∂x
`
i,j,k
=
∂e
∂y
`
i,j,k
∂ f
y
`
i,j,k
∂x
`
i,j,k
=
∂e
∂y
`
i,j,k
f
0
x
`
i,j,k
#. (3)
Weights in the previous convolutional layer can be updated by backpropagating the error to the
previous layer according to the following equation:
∂e
∂y
`−1
i,j,k
=
n1−1
X
a=0
n2−1
X
b=0
n3−1
X
c=0
∂e
∂x
`
(i−a),(j−b),(k−c)
∂x
`
(i−a),(j−b),(j−b)
∂y
`−1
i,j,k
. (4)
=
n1−1
X
a=0
n2−1
X
a=0
n3−1
X
b=0
∂e
∂x
`
(i−a),(j−b),(k−c)
ω
a,b,c
. (5)
Equation (5) allows us to calculate the error for the previous layer. Further, the above eq. makes
sense for those points which are n times away from each side of the input data. This situation can be
avoided by simply padding with zeros to the end of each side of the input volume.
2.2. Breakthroughs in CNN Architectural Advances
Several different versions of CNN have been proposed in the literature to improve model
performance. In 2011, Krizhevsky et al. [
14
] presented a deep CNN architecture. A systematic
architecture of AlexNet is shown in Figure 4. AlexNet has five convolutional layers and three fully