在线可比实体挖掘：解决比较问题的关键

72 浏览量更新于2024-08-26 收藏 1.34MB PDF 举报

"比较问题中的可比实体挖掘"研究专注于解决用户在决策过程中如何有效地找到可供比较的事物这一挑战。在日常生活中，人们经常需要对两个或多个选项进行对比，但确定比较对象和选择可能并不直观。为此，作者Shasha Li、Chin-Yew Lin、Young-In Song和Zhoujun Li提出了一个创新的方法，目标是自动从用户在线发布的比较问题中提取出可比较实体。他们的工作采用了弱监督学习策略，即通过利用大规模的在线问题数据库，构建了一种引导式方法来识别比较问题并抽取关键的可比实体。这种方法的关键在于利用序列模式挖掘（Sequential Pattern Mining）技术，通过分析用户提问的模式来发现潜在的比较意图。弱监督意味着系统依赖于初始数据集进行训练，而不需要大量标注的数据，这降低了人工标注成本，提高了算法的实用性。实验结果显示，该方法在比较问题识别方面的F1分数达到了82.5%，这表明它在准确性和查全率上都有优秀的表现，相比现有最先进的方法有显著的优势。而在可比实体提取任务中，F1分数更是达到了83.3%，这进一步证实了其在实际应用中的高效性。此外，研究人员还强调了他们的模型对于理解用户网络中的比较意图的重要性。排名结果表明，该方法能够精准地捕捉到用户在提问时的比较需求，这对于提供个性化的信息推荐或辅助决策支持具有重大价值。总结来说，这项研究为信息提取领域提供了新的视角，特别是针对比较类问题的处理，它不仅提升了自动分析和理解用户需求的能力，还为后续的研究和实际应用中更智能的问答系统开发奠定了基础。通过结合弱监督学习、序列模式挖掘和可比实体挖掘技术，该方法展示了在解决用户在线比较问题上的显著效果。

the feature compared with label $FT for each sentence.

J&L’s method was only applied to noun and pronoun. To

differentiate noun and pronoun that are not comparators or

features, they added the fourth label $NEF, i.e., nonentity-

feature. These labels were used as pivots together with

special tokens li & rj

(tokens at specific positions), #start

(beginning of a sentence), and #end (end of a sentence) to

generate sequence data, sequences with single label only

and minimum support greater than 1 percent are retained,

and then LSRs were created. When applying the learned

LSRs for extraction, LSRs with higher confidence were

applied first.

J&L’s method have been proved effective in their

experimental setups. However, it has the following

weaknesses:

. The performance of J&L’s method relies heavily on a

set of comparative sentence indicative keywords.

These keywords were manually created and they

offered no guidelines to select keywords for inclu-

sion. It is also difficult to ensure the completeness of

the keyword list.

. Users can express comparative sentences or ques-

tions in many different ways. To have high recall, a

large annotated training corpus is necessary. This is

an expensive process.

. Example CSRs and LSRs given in Jindal and Liu [7]

are mostly a combination of POS tags and keywords.

It is a surprise that their rules achieved high

precision but low recall. They attributed most errors

to POS tagging errors. However, we suspect that

their rules might be too specific and overfit their

small training set (about 2,600 sentences). We would

like to increase recall, avoid overfitting, and allow

rules to include discriminative lexical tokens to

retain precision.

In the next section, we introduce our method to address

these shortcomings.

3WEAKLY SUPERVISED METHOD FOR

COMPARATOR MINING

Our weakly supervised method is a pattern-based

approach similar to J&L’s method, but it is different in

many aspects: instead of using separate CSRs and LSRs,

our method aims to learn sequential patterns which can

be used to identify comparative question and extract

comparators simultaneously.

In our approach, a sequential pattern is defined as a

sequence Sðs

...s

Þ where s

can be a word, a POS

tag, or a symbol denoting either a comparator ($C), or the

beginning (#start) or the end of a question (#end). A

sequential pattern is called an indicative extraction pattern

(IEP) if it can be used to identify comparative questions

and extract comparators in them with high reliability. We

will formally define the reliability score of a pattern in the

next section.

Once a question matches an IEP, it is classified as a

comparative question and the token sequences correspond-

ing to the comparator slots in the IEP are extracted as

comparators. When a question can match multiple IEPs, the

longest IEP is used.

Therefore, instead of manually creating

a list of indicative keywords, we create a set of IEPs. We will

show how to acquire IEPs automatically using a boot-

strapping procedure with minimum supervision by taking

advantage of a large unlabeled question collection in the

following sections. The evaluations shown in Section 5

confirm that our weakly supervised method can achieve

high recall while retain high precision.

This pattern definition is inspired by the work of

Ravichandran and Hovy [14]. Table 1 shows some

examples of such sequential patterns. We also allow POS

constraint on comparators as shown in the pattern

“<; $C=NN or $C=NN ? #endj>”. It means that a valid

comparator must have an NN POS tag.

3.1 Mining Indicative Extraction Patterns

Our weakly supervised IEP mining approach is based on

two key assumptions.

. If a sequential pattern can be used to extract many

reliable comparator pairs, it is very likely to be an

IEP.

. If a comparator pair can be extracted by an IEP, the

pair is reliable.

Based on these two assumptions, we design our boot-

strapping algorithm as shown in Fig 1. The bootstrapping

process starts with a single IEP. From it, we extract a set of

1500 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 7, JULY 2013

1. l

marks a token is at the ith position to the left of the pivot and r

marks a token is at jth position to the right of the pivot where i and j are

between 1 and 4 in J&L [7].

2. It is because the longest IEP is likely to be the most specific and

relevant pattern for the given question.

TABLE 1

Candidate Indicative Extraction Pattern (IEP) Examples of the

Question “Which City Is Better, NYC or Paris?”

Fig. 1. Overview of the bootstrapping algorithm.

剩余11页未读，继续阅读

weixin_38577378

粉丝: 4

在线可比实体挖掘：解决比较问题的关键

MSRA中文文本标注规范——文本挖掘与NLP学习资源

交互式抽取可比语料与双语词典技术方法研究

医疗文本实体关系标注平台构建与应用研究

中老双语命名实体对位研究

电信数据挖掘数据准备过程的规范化设计

数据挖掘概念、技术－－数据预处理.ppt

跨语言命名实体翻译对抽取的研究综述

一种用于语言认知的过去时和复数问题的隐藏知识发现方法

构建统一客户信息模型：解决跨系统难题与数据挖掘

文本挖掘中的ANOVA运用：从文本到统计分析的桥梁（数据处理高级教程）

最新资源