唐建的LINE模型：大规模信息网络嵌入

需积分: 20 173 浏览量更新于2024-09-09 1 收藏 1.08MB PDF 举报

唐建的Line模型是微软亚洲研究院提出的一种针对大规模信息网络嵌入的创新算法，其目标是在低维向量空间中捕捉大型网络结构，以便于可视化、节点分类和链接预测等任务的高效执行。现有的许多图嵌入方法在处理实际世界中的大型网络，如包含数百万节点的网络时，往往面临性能瓶颈，因为它们可能无法有效地处理大规模数据。该模型名为LINE，旨在适用于不同类型的信息网络，包括无向、有向和/或加权网络。LINE的核心在于设计了一个精心优化的目标函数，这个函数不仅关注局部网络结构的保持，还重视全局网络连通性的维护。这意味着，通过线性模型，它试图在向量化过程中平衡节点间的直接连接（邻接关系）和整个网络的相似性。传统的随机梯度下降方法在处理大规模网络时可能会遇到效率和效果的问题，为了克服这一点，唐建等人提出了一个边缘采样算法。这个算法巧妙地解决了大数据集下计算复杂度的挑战，使得LINE能够在有限的计算资源下更有效地更新模型参数，从而提高模型训练的效率和性能。具体来说，LINE的实现过程可能包括以下几个关键步骤： 1. 网络分解：将复杂的大规模网络分解成局部和全局两个子问题，分别关注节点之间的直接联系和整体网络的结构模式。 2. 目标函数设计：构建一个损失函数，通过最小化与真实网络结构的差异来衡量嵌入向量的质量。这可能涉及到余弦相似度、边出现概率等指标。 3. 边缘采样：采用有策略的抽样策略，如随机负样本法，来减少计算负担，同时保持对网络结构的代表性。 4. 梯度优化：利用改进的梯度下降方法（如批量或在线学习），通过迭代调整每个节点的嵌入向量，使其满足目标函数的要求。 5. 模型评估与调优：通过节点分类或链接预测任务的性能评估，不断优化模型参数，直到达到满意的精度和效率。唐建的Line模型在解决大规模信息网络嵌入问题上取得了一定突破，其独特的优势在于对复杂网络结构的全局与局部平衡，以及高效的数据处理策略。这种模型不仅适用于学术研究，也为实际应用提供了强大的工具，例如社交网络分析、推荐系统和社区发现等领域。

The most recent work related with ours is DeepWalk [16],

which deploys a truncated random walk for social network

embedding. Although empirically eﬀective, the DeepWalk

does not provide a clear objective that articulates what net-

work properties are preserved. Intuitively, DeepWalk ex-

pects nodes with higher second-order proximity yield similar

low-dimensional representations, while the LINE preserves

both ﬁrst-order and second-order proximities. DeepWalk

uses random walks to expand the neighborhood of a vertex,

which is analogical to a depth-ﬁrst search. We use a breadth-

ﬁrst search strategy, which is a more reasonable approach to

the second-order proximity. Practically, DeepWalk only ap-

plies to unweighted networks, while our model is applicable

for networks with both weighted and unweighted edges.

In Section 5, we empirically compare the proposed model

with these methods using various real world networks.

3. PROBLEM DEFINITION

We formally deﬁne the problem of large-scale information

network embedding using ﬁrst-order and second-order prox-

imities. We ﬁrst deﬁne an information network as follows:

Definition 1. (Information Network) An informa-

tion network is deﬁned as G = (V, E), where V is the set

of vertices, each representing a data object and E is the

set of edges between the vertices, each representing a re-

lationship between two data objects. Each edge e ∈ E is

an ordered pair e = (u, v) and is associated with a weight

> 0, which indicates the strength of the relation. If G

is undirected, we have (u, v) ≡ (v, u) and w

≡ w

; if G

is directed, we have (u, v) 6≡ (v, u) and w

6≡ w

In practice, information networks can be either directed

(e.g., citation networks) or undirected (e.g., social network

of users in Facebook). The weights of the edges can be either

binary or take any real value. Note that while negative edge

weights are possible, in this study we only consider non-

negative weights. For example, in citation networks and

social networks, w

takes binary values; in co-occurrence

networks between diﬀerent objects, w

can take any non-

negative value. The weights of the edges in some networks

may diverge as some objects co-occur many times while oth-

ers may just co-occur a few times.

Embedding an information network into a low-dimensional

space is useful in a variety of applications. To conduct the

embedding, the network structures must be preserved. The

ﬁrst intuition is that the local network structure, i.e., the

local pairwise proximity between the vertices, must be pre-

served. We deﬁne the local network structures as the ﬁrst-

order proximity between the vertices:

Definition 2. (First-order Proximity) The ﬁrst-order

proximity in a network is the local pairwise proximity be-

tween two vertices. For each pair of vertices linked by an

edge (u, v), the weight on that edge, w

, indicates the ﬁrst-

order proximity between u and v. If no edge is observed

between u and v, their ﬁrst-order proximity is 0.

The ﬁrst-order proximity usually implies the similarity of

two nodes in a real-world network. For example, people who

are friends with each other in a social network tend to share

similar interests; pages linking to each other in World Wide

Web tend to talk about similar topics. Because of this im-

portance, many existing graph embedding algorithms such

as IsoMap, LLE, Laplacian eigenmap, and graph factoriza-

tion have the objective to preserve the ﬁrst-order proximity.

However, in a real world information network, the links

observed are only a small proportion, with many others

missing [10]. A pair of nodes on a missing link has a zero

ﬁrst-order proximity, even though they are intrinsically very

similar to each other. Therefore, ﬁrst-order proximity alone

is not suﬃcient for preserving the network structures, and

it is important to seek an alternative notion of proximity

that addresses the problem of sparsity. A natural intuition

is that vertices that share similar neighbors tend to be sim-

ilar to each other. For example, in social networks, people

who share similar friends tend to have similar interests and

thus become friends; in word co-occurrence networks, words

that always co-occur with the same set of words tend to

have similar meanings. We therefore deﬁne the second-order

proximity, which complements the ﬁrst-order proximity and

preserves the network structure.

Definition 3. (Second-order Proximity) The second-

order proximity between a pair of vertices (u, v) in a net-

work is the similarity between their neighborhood network

structures. Mathematically, let p

= (w

u,1

, . . . , w

u,|V |

) de-

note the ﬁrst-order proximity of u with all the other vertices,

then the second-order proximity between u and v is deter-

mined by the similarity between p

and p

. If no vertex is

linked from/to both u and v, the second-order proximity

between u and v is 0.

We investigate both ﬁrst-order and second-order proxim-

ity for network embedding, which is deﬁned as follows.

Definition 4. (Large-scale Information Network Em-

bedding) Given a large network G = (V, E), the problem

of Large-scale Information Network Embedding aims

to represent each vertex v ∈ V into a low-dimensional space

, i.e., learning a function f

: V → R

, where d  |V |.

In the space R

, both the ﬁrst-order proximity and the

second-order proximity between the vertices are preserved.

Next, we introduce a large-scale network embedding model

that preserves both ﬁrst- and second-order proximities.

4. LINE: LARGE-SCALE INFORMATION

NETWORK EMBEDDING

A desirable embedding model for real world information

networks must satisfy several requirements: ﬁrst, it must

be able to preserve both the ﬁrst-order proximity and the

second-order proximity between the vertices; second, it must

scale for very large networks, say millions of vertices and bil-

lions of edges; third, it can deal with networks with arbitrary

types of edges: directed, undirected and/or weighted. In this

section, we present a novel network embedding model called

the “LINE,” which satisﬁes all the three requirements.

4.1 Model Description

We describe the LINE model to preserve the ﬁrst-order

proximity and second-order proximity separately, and then

introduce a simple way to combine the two proximity.

4.1.1 LINE with First-order Proximity

The ﬁrst-order proximity refers to the local pairwise prox-

imity between the vertices in the network. To model the

剩余10页未读，继续阅读

Constantinople_XS

粉丝: 1
资源: 5

唐建的LINE模型：大规模信息网络嵌入

MongoDB 4.0：非关系型数据库的ACID新纪元 - PostgresChina2018 唐建法演讲

网络经济下多渠道营销研究进展与挑战：理论检验与实践构建

MongoDB与Spark结合：大数据解决方案

MongoDB北京2014 - MongoDB性能扩展 - 唐建法

MongoDB Spark - Mongo首席技術架構師唐建法

学习知识图谱推理的符号逻辑规则（来自MILA-唐建）

唐建法-NoSQL之王：一分钟从关系型迁移到MongoDB

藏经阁-PostgresChina2018_唐建法_MongoDB_4.0_开创_NoSQL_＋_ACID新纪元.pdf

挖掘臂的液压缸摩擦模型辨识与分析.pdf

2008年疏浚系统泥浆浓度自校正前馈控制实证研究

最新资源