Probase+：解决概念分类中缺失链接的问题

48 浏览量更新于2024-07-15 收藏 1.21MB PDF 举报

"Probase +：推断概念分类法中的缺失链接"这篇研究论文深入探讨了在构建大规模概念分类体系或语义网络时遇到的一个关键问题——缺失链接。当前，许多研究集中在自动从大量文本语料库中生成这些知识结构，但Probase +的研究者发现，这些分类法中存在的缺失链接成为阻碍它们在实际应用中广泛采纳的主要障碍。因为缺失的连接会破坏概念分类法声称支持的推理能力。传统的Probase是一个先进的数据驱动型概念分类系统，它试图通过自动化的手段来构建和扩展。然而，论文指出，尽管Probase在初始构建中取得了显著成就，但在实际应用中，如知识库的完整性提升或推理过程中，那些未被捕捉的链接关系至关重要。为了克服这一挑战，作者提出了一个协作过滤框架，该框架旨在利用文本语料库中的信息来推断出缺失的链接。这种方法不仅能够填充这些空白，还力求在推断过程中保持高准确度，例如，他们将Probase扩展到包含约510万条额外的联系（大约增加了30%），且准确性达到了90%以上。实验部分展示了修订后的概念分类法的质量，这表明通过Probase +技术，即使在面对缺失链接时，也能够有效地增强知识表示的完整性和实用性。研究的关键术语包括知识库完成（knowledge base completion）和协作过滤（collaborative filtering），这两种技术在本工作中起到了核心作用，旨在提升概念分类法在现实世界中的实用价值和影响力。这篇论文提供了一个有效的策略来解决概念分类法中的缺失链接问题，对于推动此类知识图谱在实际场景中的应用具有重要意义。

to a term c (representing an entity or a concept). Speciﬁcally, if

most similar terms of c have h as the hypernym, c is likely to have

the hypernym h. We propose an iterative framework to realize

this idea. Algorithm 1 outlines the framework. In each itera-

tion, we use CF-based approach to ﬁnd missing hypernyms

for each term c in the current taxonomy. Speciﬁcally, for each

c, we ﬁrst ﬁnd top-K terms that are similar to c (line 3). Each

hypernym (h)ofthesetop-K terms but c is a candidate hyper-

nym (line 4). We rank candidate hypernyms by aggregating

the votes from the top-K similar terms of c by a scoring func-

tion (line 5). We add the candidates with a score larger than

the threshold u as the missing hypernyms of c (line 6). Fig. 3

illustrates the key roles in the CF-based recommendation

model, where TðcÞ is the intersection of c’s existing hyper-

nyms and c’s similar terms’ hypernyms. TðcÞ will be used as

prior knowledge to estimate the frequency of newly discov-

ered isA relations (Section 3.3) and to tune a threshold for the

selection of the ﬁnal hypernyms (Section 4.2).

Algorithm 1. CF-Based Missing isA Relationship Finding

Input: Taxonomy T , parameters K; u

1: while iteration < max

iteration do

2: for term c 2 T do

3: SimðcÞ top-K similar terms of c;

4: CandidateðcÞ ½

s2SimðcÞ

hypeðsÞ  hype ðcÞ;

5: Rank candidates in CandidateðcÞ by a scoring

function f;

6: Update T by attaching c to any x 2 CandidateðcÞ s.t.

fðx Þu;

7: end for

8: iteration iteration þ 1;

9: end while

Relationship to User-Based KNN Recommendation. Our solu-

tion is essentially the user-based KNN approach, which is

widely used in recommender systems. There are two types of

recommendation algorithms: user-based [24], [25] and item-

based [26], [27]. In out settings, the user is the term for which

we need to ﬁnd its missing hypernyms as well as its similar

terms, while item means the missing hypernyms. Item-based

algorithms in general are not applicable in our problem

because a similar hypernym is not necessarily an appropriate

hypernym for a speciﬁc instance, as illustrated in Example 2.

Although recommender systems have been extensively stud-

ied, we still need great efforts to tailor the generic user-based

KNN approach to our problems because of a completely dif-

ferent setting.

1.4 Challenges and Contributions

Using CF to solve our problem is not trivial. In general, we

still have the following obstacles to overcome.

C1: Similarity metric. We need an effective semantic simi-

larity metric to ﬁnd similar terms. Since Probase has ambigu-

ous words or phrases, and noisy or missing isA relationships,

it is not easy to ensure the effectiveness of the metric. We

will address this issue in Section 2.

C2: Relationship frequency. In Probase, each isA relationship

is associated with a frequency that the relationship is observed

from corpora. The frequency is critical for the successful usage

of Probase in real applications [23], [28]. While the generic CF

framework can only give a likelihood that a concept is a right

hypernym of a term. We still need great efforts to estimate an

appropriate frequency for the predicted missing relationships.

In Section 3 we propose a regression model to estimate the

frequency of newly found missing relationships.

C3: Parameter tuning. There are two key parameters in our

solution (as shown in Algorithm 1): K (used for the selection

of the similar concepts of c)andu (used for the selection of

ﬁnal missing hypernyms). In general, a ﬁxed setting is not an

optimal choice due to the heterogeneity of data. We propose

heuristics to set a best K and u for different c in Section 4.

C4: Efﬁciency. Probase has millions of terms and tens of

millions of relationships. A straightforward solution is not

efﬁcient. In Section 2.5, we identify the bottleneck (the com-

putation of similarity metrics). We further propose a sum-

mary based pruning strategy to speedup the computation

in Section 5.

In summary, our contributions are as follows:

 We propose a collaborative ﬁltering based inferenc-

ing framework to ﬁnd missing isA relationships in a

data-driven taxonomy.

 We propose effective similarity metrics (to measure a

pair of terms) and ranking mechanisms (of candidate

missing hypernyms) to enable the effective inference

of missing isA relationships under the CF framework.

 We identify the performance bottleneck and propose a

corresponding speedup strategy. We also give param-

eter tuning solutions to improve the effectiveness.

 We conduct extensive experiments to justify the pro-

posed models and solutions. We build a taxonomy

Probase+ containing 5.1 million (about 30 percent)

more isA relationships than its prototype Probase, and

its accuracy is above 90 percent. As far as we know, it

is the largest conceptual taxonomy ever known.

The rest of the paper is organized as follows. In Section 2,

we give the similarity metrics to ﬁnd similar terms. In

Section 3, we elaborate the ranking scores of candidate

hypernyms. In Section 4, we describe our method of param-

eter tuning. In Section 5, we optimize the performance of

our solution. In Section 6, we report the effectiveness of our

approach based on comprehensive experiments. We discuss

related works in Section 7, and conclude in Section 8.

2SIMILARITY METRICS

To apply collaborative ﬁltering, we need to ﬁnd terms that

are similar to a given term. In general, our framework is

Fig. 2. Probase fragment: “car” and “automobile.”

Fig. 3. The CF model.

LIANG ET AL.: PROBASE+: INFERRING MISSING LINKS IN CONCEPTUAL TAXONOMIES 1283

剩余14页未读，继续阅读

weixin_38735119

粉丝: 7
资源: 876

Probase+：解决概念分类中缺失链接的问题

probase.pptx

probase-manual.pdf

知识图谱 概念与技术 第4章：概念图谱构建.pdf

大规模词法分类中基于图的错误IsA关系检测

大规模分类体系构建

用集体实体填充知识库：基于图的方法

KBQA: Learning Question Answering over QA Corpora and Knowledge Bases.pdf

复旦大学肖仰华教授讲解：大规模概念图谱构建与应用

知识图谱驱动的问答系统：构建高效推理模型

知识图谱问答系统研究：模板方法与语义社团挖掘

最新资源

知识图谱概念与技术第4章：概念图谱构建.pdf