Stochastic Training of Graph Convolutional Networks with Variance Reduction
Esti. VNS VD
Exact 0 S
(l)
u
NS (P
uv
1
µ
(l)
v
1
− P
uv
2
µ
(l)
v
2
)
2
n(u)
D
(l)
S
(l)
u
CV (P
uv
1
∆µ
(l)
v
1
− P
uv
2
∆µ
(l)
v
2
)
2
3 +
n(u)
D
(l)
S
(l)
u
CVD (P
uv
1
∆µ
(l)
v
1
− P
uv
2
∆µ
(l)
v
2
)
2
S
(l)
u
Table 2.
Variance of different estimators. To save space we omit
C
(l)
u
2D
(l)
P
v
1
,v
2
∈n(u)
before all the VNS terms.
5.1. Control Variate for Dropout
With dropout,
∆h
(l)
v
= h
(l)
v
−
¯
h
(l)
v
is not necessarily small
even if
¯
h
(l)
v
and
h
(l)
v
have the same distribution. We develop
another stochastic approximation algorithm, control variate
for dropout (CVD), that works well with dropout.
Our method is based on the weight scaling procedure (Sri-
vastava et al., 2014) to approximately compute the mean
µ
(l)
v
:= E
M
h
h
(l)
v
i
. That is, along with the dropout model,
we can run a copy of the model without dropout to obtain the
mean
µ
(l)
v
, as illustrated in Fig. 1(d). We obtain a stochastic
approximation by separating the mean and variance
(P H
(l)
)
u
=
X
v∈n(u)
P
uv
(
˚
h
(l)
v
+ ∆µ
(l)
v
+ ¯µ
(l)
v
) ≈ CVD
(l)
u
:=
√
R
X
v∈
ˆ
n
P
uv
˚
h
(l)
v
+ R
X
v∈
ˆ
n
P
uv
∆µ
(l)
v
+
X
v∈n(u)
P
uv
¯µ
(l)
v
,
where
n =
ˆ
n
(l)
(u)
,
R = n(u)/D
(l)
for short,
˚
h
(l)
v
=
h
(l)
v
−µ
(l)
v
,
¯µ
(l)
v
is the historical mean activation, obtained by
storing
µ
(l)
v
instead of
h
(l)
v
, and
∆µ
(l)
v
= µ
(l)
v
−¯µ
(l)
v
. We sep-
arate
h
(l)
v
as three terms, the latter two terms on
µ
(l)
v
do not
have the randomness from dropout, and
µ
(l)
v
are treated as if
h
(l)
v
for the CV estimator. The first term has zero mean w.r.t.
dropout, i.e.,
E
M
˚
h
(l)
v
= 0
. We have
E
ˆ
n
(l)
(u)
E
M
CVD
(l)
u
=
0 +
P
v∈n(u)
P
uv
(∆µ
(l)
v
+ ¯µ
(l)
v
) = E
M
(P H
(l)
)
u
, i.e., the
estimator is unbiased, and we shall see that the estimator
eventually has the correct variance if
h
(l)
v
’s are uncorrelated
in Sec. 5.2.
5.2. Variance Analysis
We analyze the variance under the assumption that the
node activations are uncorrelated, i.e.,
Cov
M
h
h
(l)
v
1
, h
(l)
v
2
i
=
0, ∀v
1
6= v
2
. We report the correlation between nodes em-
pirically in Appendix G. To facilitate the analysis of the vari-
ance, we introduce two propositions proven in Appendix A .
The first helps the derivation of the dropout variance; and
the second implies that we can treat the variance introduced
by neighbor sampling and by dropout separately.
Proposition 2.
If
ˆ
n
(l)
(u)
contains
D
(l)
samples from the
set
n(u)
without replacement,
x
1
, . . . , x
V
are random
variables,
∀v, E [x
v
] = 0
and
∀v
1
6= v
2
, Cov [x
v
1
, x
v
2
] =
0
, then
Var
X,
ˆ
n
(l)
(u)
h
n(u)
D
(l)
P
v∈
ˆ
n
(l)
(u)
x
v
i
=
n(u)
D
(l)
P
v∈n(u)
Var [x
v
] .
Proposition 3. X
and
Y
are two random variables, and
f(X, Y )
and
g(Y )
are two functions. If
E
X
f(X, Y ) =
0
, then
Var
X,Y
[f(X, Y ) + g(Y )] = Var
X,Y
f(X, Y ) +
Var
Y
g(Y ).
By Proposition 3,
Var
ˆ
n
Var
M
CVD
(l)
u
can be writ-
ten as the sum of
Var
ˆ
n
Var
M
h
√
R
P
v∈
ˆ
n
P
uv
˚
h
(l)
v
i
and
Var
ˆ
n
h
R
P
v∈
ˆ
n
P
uv
∆µ
(l)
v
+
P
v∈n(u)
P
uv
¯µ
(l)
v
i
.
We refer
the first term as the variance from dropout (VD) and
the second term as the variance from neighbor sam-
pling (VNS). Ideally, VD should equal to the variance
of
(P H
(l)
)
u
and VNS should be zero. VNS can be de-
rived by replicating the analysis in Sec. 3.2, replacing
h
with
µ
. Let
s
(l)
v
= Var
M
h
(l)
v
= Var
M
˚
h
(l)
v
, and
S
(l)
u
=
Var
M
(P H
(l)
)
u
=
P
v∈n(u)
P
2
uv
s
(l)
v
, By Proposition 2, VD
of
CVD
(l)
u
,
P
v∈n(u)
P
2
uv
Var
h
˚
h
(l)
v
i
= S
(l)
u
, equals with the
VD of the exact estimator as desired.
We summarize the estimators and their variances in Table 2,
where the derivations are in Appendix A. As in Sec. 3.2,
VNS of CV and CVD depends on
∆µ
v
, which converges to
zero as the training progresses, while VNS of NS depends
on the non-zero
µ
v
. On the other hand, CVD is the only
estimator except the exact one that gives correct VD.
5.3. Preprocessing Strategy
There are two possible models adopting dropout,
Z
(l+1)
= P Dropout
p
(H
(l)
)W
(l)
or
Z
(l+1)
=
Dropout
p
(P H
(l)
)W
(l)
. The difference is whether
the dropout layer is before or after neighbor averaging. Kipf
& Welling (2017) adopt the former one, and we adopt the
latter one, while the two models performs similarly in prac-
tice, as we shall see in Sec. 6.1. The advantage of the latter
model is that we can preprocess
U
(0)
= P H
(0)
= P X
and
takes U
(0)
as the new input. In this way, the actual number
of graph convolution layers is reduced by one — the first
layer is merely a fully-connected layer instead of a graph
convolution one. Since most GCNs only have two graph
convolution layers (Kipf & Welling, 2017; Hamilton et al.,
2017a), this gives a significant reduction of the receptive
field size and speeds up the computation. We refer this
optimization as the preprocessing strategy.
6. Experiments
We examine the variance and convergence of our algo-
rithms empirically on six datasets, including Citeseer, Cora,