Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
RESNETS ENSEMBLE VIA THE FEYNMAN–KAC FORMALISM 563
groups of researchers use model averaging for different base DNNs and won different ILSVRC
competitions [29, 48, 21]. This widely used unweighted averaging ensemble, however, is not
data adaptive and is sensitive to the presence of excessively biased base learners. Ju, Bibaut,
and van der Laan [25] recently investigated ensemble of DNNs by many different ensemble
methods, including unweighted averaging, majority voting, the Bayes Optimal Classifier, and
the (discrete) Super Learner, for image recognition tasks. They concluded that the Super
Learner achieves the best performance among all the studied ensemble algorithms.
Our work distinguishes from the existing work on DNN ensemble and feature and input
smoothing from two major points. First, we inject Gaussian noise to each residual mapping
in the ResNet. Second, we jointly train each component of the ensemble instead of using a
sequential training.
1.3. Organization. We organize this paper in the following way. In section 2, we model
the ResNet as a TE and give an explanation for ResNet's adversarial vulnerability. In section 3,
we present a new ResNet ensemble algorithm motivated from the Feynman–Kac formula for
adversarial defense. In section 4, we present the natural accuracy of the EnResNets and
their robust accuracy under both white-box and blind PGD and Carlini–Wagner attacks and
compare with the current state of the art. In sections 5 and 6, we generalize the algorithm
to ensembles of different neural nets and different noise injections, and we numerically verify
their efficiency. In section 7, we numerically study the sparsity pattern of EnResNets' weights.
Our paper ends up with some concluding remarks.
2. Theoretical motivation and guarantees.
2.1. TE modeling of ResNets. The connection between training ResNet and solving
optimal control problems of the TE is investigated in [52, 53, 35, 54, 55]. In this section,
we derive the TE model for ResNet and explain its adversarial vulnerability from a PDE
viewpoint. The TE model enables us to understand the data flow of the entire training and
testing data in both forward and backward propagation in training and testing of ResNets,
whereas the ODE models focus on the dynamics of individual data points [10].
As shown in Figure 1(a), residual mapping adds a skip connection to connect the input
and output of the original mapping (F), and the lth residual mapping can be written as
x
l+1
= F(x
l
, w
l
) + x
l
,
where x
0
=
ˆ
x ∈ T ⊂ R
d
is a data point in the set T , x
l
and x
l+1
are the input and output
tensors of the residual mapping. The parameters w
l
can be learned by back propagating the
training error. For ∀
ˆ
x ∈ T with label y, the forward propagation of ResNet can be written
as
x
l+1
= x
l
+ F(x
l
, w
l
), l = 0, 1, . . . , L − 1, with x
0
=
ˆ
x,
ˆy
.
= f(x
L
),
(2.1)
where ˆy is the predicted label, L is the number of layers, and f(x) = softmax(w
0
· x) is
the output activation with w
0
being the trainable parameters. For the widely used residual
mapping in the preactivated ResNet [22], as shown in Figure 3(a), we have
F(x
l
, w
l
) = w
C2
l
⊗ σ(w
B2
l
w
C1
l
⊗ σ(w
B1
l
x
l
)),(2.2)
Downloaded 10/06/20 to 175.7.37.141. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php