大规模分布式图学习：一种流式近似方法

需积分: 9 167 浏览量更新于2024-09-09 收藏 699KB PDF 举报

"这篇论文探讨了大规模分布式半监督学习中基于图的方法，特别是针对大量数据和大量标签场景的问题。传统的图基半监督学习方法由于其空间复杂度与边的数量（|E|）和独特标签的数量（m）成线性关系，不适用于处理这些问题。为了解决大量标签的问题，近期的研究提出了一种基于sketch的方法来近似每个节点的标签分布，从而将空间复杂度降低到O(logm)。" 在论文中，作者Sujith Ravi和Qiming Diao提出了一种新颖的流式图基半监督学习近似方法，该方法能够有效地捕捉标签分布的稀疏性，并进一步将每个节点的空间复杂度降低到O(1)。这是一项重要的改进，因为它极大地减少了所需的存储空间，使得算法更适用于处理大规模数据集。此外，他们还提供了一个分布式版本的算法，该算法能很好地扩展以适应大型数据集的规模。这意味着在处理海量数据时，该方法可以并行化运行，提高计算效率。实验结果显示，新方法在真实世界数据集上的表现优于现有的最先进的算法，证明了其在性能上的优势。基于图的机器学习方法通常依赖于构建数据的图结构，其中节点代表数据实例，边则表示实例之间的关系。在半监督学习中，少量有标签的实例被用来指导对大量无标签实例的学习。通过近似节点的标签分布，这种方法能够在数据稀疏且标签信息有限的情况下，有效地进行预测。在本文中，作者不仅解决了空间复杂度问题，还关注了算法的可扩展性和实际应用中的性能。他们的工作为大规模数据集的半监督学习提供了一种高效、低内存占用的解决方案，对于处理如社交网络、推荐系统等领域的大型图数据特别有用。这项研究对机器学习领域的理论发展和实践应用都具有重要意义，尤其是在处理大数据时如何实现高效且准确的预测。

Large Scale Distributed Semi-Supervised Learning Using Streaming

Approximation

Sujith Ravi Qiming Diao

Google Inc., Mountain View, CA, USA

sravi@google.com

Carnegie Mellon University, Pittsburgh, PA, USA

Singapore Mgt. University, Singapore

qiming.ustc@gmail.com

Abstract

Traditional graph-based semi-supervised

learning (SSL) approaches are not suited for

massive data and large label scenarios since

they scale linearly with the number of edges

|E| and distinct labels m. To deal with

the large label size problem, recent works

propose sketch-based methods to approxi-

mate the label distribution per node thereby

achieving a space reduction from O(m) to

O(log m), under certain conditions. In this

paper, we present a novel streaming graph-

based SSL approximation that eﬀectively

captures the sparsity of the label distribution

and further reduces the space complexity per

node to O(1). We also provide a distributed

version of the algorithm that scales well to

large data sizes. Experiments on real-world

datasets demonstrate that the new method

achieves better performance than existing

state-of-the-art algorithms with signiﬁcant

reduction in memory footprint. Finally,

we propose a robust graph augmentation

strategy using unsupervised deep learning

architectures that yields further signiﬁcant

quality gains for SSL in natural language

applications.

1 Introduction

Semi-supervised learning (SSL) methods use small

amounts of labeled data along with large amounts of

unlabeled data to train prediction systems. Such ap-

proaches have gained widespread usage in recent years

Work done during an internship at Google.

Appearing in Proceedings of the 19

International Con-

ference on Artiﬁcial Intelligence and Statistics (AISTATS)

2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright

2016 by the authors.

and have been rapidly supplanting supervised systems

in many scenarios owing to the abundant amounts of

unlabeled data available on the Web and other do-

mains. Annotating and creating labeled training data

for many predictions tasks is quite challenging because

it is often an expensive and labor-intensive process.

On the other hand, unlabeled data is readily available

and can be leveraged by SSL approaches to improve

the performance of supervised prediction systems.

There are several surveys that cover various SSL meth-

ods in the literature [25, 37, 8, 6]. The majority of

SSL algorithms are computationally expensive; for ex-

ample, transductive SVM [16]. Graph-based SSL algo-

rithms [38, 17, 33, 4, 26, 30] are a subclass of SSL tech-

niques that have received a lot of attention recently,

as they scale much better to large problems and data

sizes. These methods exploit the idea of constructing

and smoothing a graph in which data (both labeled

and unlabeled) is represented by nodes and edges link

vertices that are related to each other. Edge weights

are deﬁned using a similarity function on node pairs

and govern how strongly the labels of the nodes con-

nected by the edge should agree. Graph-based meth-

ods based on label propagation [38, 29] work by using

class label information associated with each labeled

“seed” node, and propagating these labels over the

graph in a principled, iterative manner. These meth-

ods often converge quickly and their time and space

complexity scales linearly with the number of edges

|E| and number of labels m. Successful applications

include a wide range of tasks in computer vision [36],

information retrieval (IR) and social networks [34] and

natural language processing (NLP); for example, class

instance acquisition and relation prediction, to name

a few [30, 27, 19].

Several classiﬁcation and knowledge expansion type

of problems involve a large number of labels in real-

world scenarios. For instance, entity-relation classi-

ﬁcation over the widely used Freebase taxonomy re-

quires learning over thousands of labels which can grow

further by orders when extending to open-domain ex-

arXiv:1512.01752v2 [cs.LG] 16 May 2016

下载后可阅读完整内容，剩余9页未读，立即下载

josephine_sun

粉丝: 0

大规模分布式图学习：一种流式近似方法

图机器学习综述：网络嵌入、图正则化神经网络和图神经网络

基于Object Bank方法的机器学习论文代码实现

探索机器学习论文：World Model的深度学习与强化学习笔记集

基于机器学习的图书的推荐系统论文+Java、机器学习+图书推荐、机器学习

机器学习结课论文-基于多种机器学习算法的分类预测研究

基于机器学习的论文评分研究.pdf

机器学习论文

基于机器学习的论文作者名消歧方法研究.pdf

基于机器学习研究IPV发生率论文

基于python机器学习的舌苔检测系统（论文+源码）

最新资源