
Large Scale Distributed Semi-Supervised Learning Using Streaming
Approximation
Sujith Ravi Qiming Diao
1
Google Inc., Mountain View, CA, USA
sravi@google.com
Carnegie Mellon University, Pittsburgh, PA, USA
Singapore Mgt. University, Singapore
qiming.ustc@gmail.com
Abstract
Traditional graph-based semi-supervised
learning (SSL) approaches are not suited for
massive data and large label scenarios since
they scale linearly with the number of edges
|E| and distinct labels m. To deal with
the large label size problem, recent works
propose sketch-based methods to approxi-
mate the label distribution per node thereby
achieving a space reduction from O(m) to
O(log m), under certain conditions. In this
paper, we present a novel streaming graph-
based SSL approximation that effectively
captures the sparsity of the label distribution
and further reduces the space complexity per
node to O(1). We also provide a distributed
version of the algorithm that scales well to
large data sizes. Experiments on real-world
datasets demonstrate that the new method
achieves better performance than existing
state-of-the-art algorithms with significant
reduction in memory footprint. Finally,
we propose a robust graph augmentation
strategy using unsupervised deep learning
architectures that yields further significant
quality gains for SSL in natural language
applications.
1 Introduction
Semi-supervised learning (SSL) methods use small
amounts of labeled data along with large amounts of
unlabeled data to train prediction systems. Such ap-
proaches have gained widespread usage in recent years
1
Work done during an internship at Google.
Appearing in Proceedings of the 19
th
International Con-
ference on Artificial Intelligence and Statistics (AISTATS)
2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright
2016 by the authors.
and have been rapidly supplanting supervised systems
in many scenarios owing to the abundant amounts of
unlabeled data available on the Web and other do-
mains. Annotating and creating labeled training data
for many predictions tasks is quite challenging because
it is often an expensive and labor-intensive process.
On the other hand, unlabeled data is readily available
and can be leveraged by SSL approaches to improve
the performance of supervised prediction systems.
There are several surveys that cover various SSL meth-
ods in the literature [25, 37, 8, 6]. The majority of
SSL algorithms are computationally expensive; for ex-
ample, transductive SVM [16]. Graph-based SSL algo-
rithms [38, 17, 33, 4, 26, 30] are a subclass of SSL tech-
niques that have received a lot of attention recently,
as they scale much better to large problems and data
sizes. These methods exploit the idea of constructing
and smoothing a graph in which data (both labeled
and unlabeled) is represented by nodes and edges link
vertices that are related to each other. Edge weights
are defined using a similarity function on node pairs
and govern how strongly the labels of the nodes con-
nected by the edge should agree. Graph-based meth-
ods based on label propagation [38, 29] work by using
class label information associated with each labeled
“seed” node, and propagating these labels over the
graph in a principled, iterative manner. These meth-
ods often converge quickly and their time and space
complexity scales linearly with the number of edges
|E| and number of labels m. Successful applications
include a wide range of tasks in computer vision [36],
information retrieval (IR) and social networks [34] and
natural language processing (NLP); for example, class
instance acquisition and relation prediction, to name
a few [30, 27, 19].
Several classification and knowledge expansion type
of problems involve a large number of labels in real-
world scenarios. For instance, entity-relation classi-
fication over the widely used Freebase taxonomy re-
quires learning over thousands of labels which can grow
further by orders when extending to open-domain ex-
arXiv:1512.01752v2 [cs.LG] 16 May 2016