Cluster-GCN: An Eicient Algorithm for Training Deep and
Large Graph Convolutional Networks
Wei-Lin Chiang
∗
National Taiwan University
r06922166@csie.ntu.edu.tw
Xuanqing Liu
∗
University of California, Los Angeles
xqliu@cs.ucla.edu
Si Si
Google Research
sisidaisy@google.com
Yang Li
Google Research
liyang@google.com
Samy Bengio
Google Research
bengio@google.com
Cho-Jui Hsieh
University of California, Los Angeles
chohsieh@cs.ucla.edu
ABSTRACT
Graph convolutional network (GCN) has been successfully applied
to many graph-based applications; however, training a large-scale
GCN remains challenging. Current SGD-based algorithms suer
from either a high computational cost that exponentially grows
with number of GCN layers, or a large space requirement for keep-
ing the entire graph and the embedding of each node in memory. In
this paper, we propose Cluster-GCN, a novel GCN algorithm that is
suitable for SGD-based training by exploiting the graph clustering
structure. Cluster-GCN works as the following: at each step, it sam-
ples a block of nodes that associate with a dense subgraph identied
by a graph clustering algorithm, and restricts the neighborhood
search within this subgraph. This simple but eective strategy leads
to signicantly improved memory and computational eciency
while being able to achieve comparable test accuracy with previous
algorithms. To test the scalability of our algorithm, we create a
new Amazon2M data with 2 million nodes and 61 million edges
which is more than 5 times larger than the previous largest publicly
available dataset (Reddit). For training a 3-layer GCN on this data,
Cluster-GCN is faster than the previous state-of-the-art VR-GCN
(1523 seconds vs 1961 seconds) and using much less memory (2.2GB
vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our
algorithm can nish in around 36 minutes while all the existing
GCN training algorithms fail to train due to the out-of-memory
issue. Furthermore, Cluster-GCN allows us to train much deeper
GCN without much time and memory overhead, which leads to
improved prediction accuracy—using a 5-layer Cluster-GCN, we
achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while
the previous best result was 98.71 by [16].
ACM Reference Format:
Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui
Hsieh. 2019. Cluster-GCN: An Ecient Algorithm for Training Deep and
∗
This work was done during the rst and the second author’s internship at Google
Research.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
KDD ’19, August 4–8, 2019, Anchorage, AK, USA
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6201-6/19/08.
https://doi.org/10.1145/3292500.3330925
Large Graph Convolutional Networks. In The 25th ACM SIGKDD Con-
ference on Knowledge Discovery and Data Mining (KDD ’19), August 4–
8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 10 pages. https:
//doi.org/10.1145/3292500.3330925
1 INTRODUCTION
Graph convolutional network (GCN) [
9
] has become increasingly
popular in addressing many graph-based applications, including
semi-supervised node classication [
9
], link prediction [
17
] and
recommender systems [
15
]. Given a graph, GCN uses a graph con-
volution operation to obtain node embeddings layer by layer—at
each layer, the embedding of a node is obtained by gathering the
embeddings of its neighbors, followed by one or a few layers of
linear transformations and nonlinear activations. The nal layer
embedding is then used for some end tasks. For instance, in node
classication problems, the nal layer embedding is passed to a
classier to predict node labels, and thus the parameters of GCN
can be trained in an end-to-end manner.
Since the graph convolution operator in GCN needs to propagate
embeddings using the interaction between nodes in the graph, this
makes training quite challenging. Unlike other neural networks
that the training loss can be perfectly decomposed into individual
terms on each sample, the loss term in GCN (e.g., classication
loss on a single node) depends on a huge number of other nodes,
especially when GCN goes deep. Due to the node dependence,
GCN’s training is very slow and requires lots of memory – back-
propagation needs to store all the embeddings in the computation
graph in GPU memory.
Previous GCN Training Algorithms:
To demonstrate the
need of developing a scalable GCN training algorithm, we rst
discuss the pros and cons of existing approaches, in terms of 1)
memory requirement
1
, 2) time per epoch
2
and 3) convergence
speed (loss reduction) per epoch. These three factors are crucial for
evaluating a training algorithm. Note that memory requirement
directly restricts the scalability of algorithm, and the later two
factors combined together will determine the training speed. In the
following discussion we denote
N
to be the number of nodes in the
graph,
F
the embedding dimension, and
L
the number of layers to
analyze classic GCN training algorithms.
•
Full-batch gradient descent is proposed in the rst GCN pa-
per [
9
]. To compute the full gradient, it requires storing all the
1
Here we consider the memory for storing node embeddings, which is dense and
usually dominates the overall memory usage for deep GCN.
2
An epoch means a complete data pass.
arXiv:1905.07953v1 [cs.LG] 20 May 2019