无监督词与依存路径嵌入的方面术语抽取

需积分: 9 6 浏览量更新于2024-08-13 收藏 691KB PDF 举报

"这篇研究论文探讨了一种新颖的无监督方法，用于词条项（Aspect Term）提取，通过词和依存路径的嵌入学习来实现。这种方法连接两个词（w1和w2）在嵌入空间中的依赖路径（r），优化低维度空间中的w1 + r ⇣ w2目标函数，将多跳依赖路径视为语法关系序列，并用循环神经网络进行建模。此外，设计了考虑线性上下文和依赖上下文信息的嵌入特征，用于基于条件随机场（CRF）的词条项提取。在SemEval数据集上的实验结果显示，该方法仅使用嵌入信息就表现出色，与有监督方法相比，甚至在没有标注数据的情况下也能取得良好效果。" 论文详细内容：无监督学习在自然语言处理领域具有重要的价值，因为它可以处理大量未标注的数据，尤其在缺乏大量标注语料的情况下。本研究提出的无监督词和依存路径嵌入方法专注于词条项提取，这是情感分析、意见挖掘等领域的一个关键任务。词条项通常指的是用户评论或文本中描述产品或服务特性的词汇。在传统的有监督方法中，依赖路径信息常被用于捕获词汇之间的语义关系，但这些方法依赖于大量的标注数据。相反，该论文提出的方法在无监督环境下，利用分布式表示学习词和依存路径的语义特性。具体来说，论文的核心是构建词与词之间的关系模型，通过连接词与词之间的依存路径，在低维空间中优化一个目标函数，即w1 + r ⇣ w2，其中r代表词w1和w2之间的依赖关系。这种优化有助于捕捉路径上的语义信息，因为多跳依赖路径被视为一系列的语法关系，由循环神经网络（RNN）进行建模，RNN能够有效地处理序列数据并记忆长期依赖。为了进一步提升模型的性能，研究者还设计了嵌入特征，这些特征不仅考虑了词语的线性上下文，还考虑了依赖关系上下文信息。这些特征被整合到基于条件随机场（CRF）的模型中，CRF是一种常用的序列标注模型，适用于识别连续的词条项。实验部分，论文在多个SemEval评测数据集上进行了评估，结果表明，即使没有使用任何标注数据，该方法也能实现良好的性能。与传统的有监督方法相比，它在某些情况下表现得相当，甚至在某些指标上超过了有监督的基线模型。这验证了无监督学习在词条项提取任务上的潜力，尤其是在标注资源有限的场景下。这项工作为无监督的自然语言处理任务提供了一个新的视角，特别是对于那些依赖于大量标注数据的任务，如词条项提取。通过结合词嵌入和依存路径信息，论文提出的方法能够在无监督的环境中学习到有用的语义关系，为未来的相关研究提供了新的思路。

Unsupervised Word and Dependency Path Embeddings

for Aspect Term Extraction

Yichun Yin

, Furu Wei

, Li Dong

, Kaimeng Xu

, Ming Zhang

1⇤

, Ming Zhou

School of EECS, Peking University

Microsoft Research

Institute for Language, Cognition and Computation, University of Edinburgh

{yichunyin,1300012834,mzhang cs}@pku.edu.cn,{fuwei,mingzhou}@microsoft.com,li.dong@ed.ac.uk

Abstract

In this paper, we develop a novel approach to as-

pect term extraction based on unsupervised learn-

ing of distributed representations of words and de-

pendency paths. The basic idea is to connect two

words (w

and w

) with the dependency path (r)

between them in the embedding space. Speciﬁ-

cally, our method optimizes the objective w

+ r ⇡

in the low-dimensional space, where the multi-

hop dependency paths are treated as a sequence

of grammatical relations and modeled by a recur-

rent neural network. Then, we design the embed-

ding features that consider linear context and de-

pendency context information, for the conditional

random ﬁeld (CRF) based aspect term extraction.

Experimental results on the SemEval datasets show

that, (1) with only embedding features, we can

achieve state-of-the-art results; (2) our embedding

method which incorporates the syntactic informa-

tion among words yields better performance than

other representative ones in aspect term extraction.

1 Introduction

Aspect term extraction

[

Hu and Liu, 2004; Pontiki et al.,

2014; 2015

]

aims to identify the aspect expressions which

refer to the product’s or service’s properties (or attributes),

from the review sentence. It is a fundamental step to ob-

tain the ﬁne-grained sentiment of speciﬁc aspects of a prod-

uct, besides the coarse-grained overall sentiment. Until now,

there have been two major approaches for aspect term ex-

traction. The unsupervised (or rule based) methods

[

Qiu et

al., 2011

]

rely on a set of manually deﬁned opinion words as

seeds and rules derived from syntactic parsing trees to itera-

tively extract aspect terms. The supervised methods

[

Jakob

and Gurevych, 2010; Li et al., 2010; Chernyshevich, 2014;

Toh and Wang, 2014; San Vicente et al., 2015

]

usually treat

aspect term extraction as a sequence labeling problem, and

conditional random ﬁeld (CRF) has been the mainstream

method in the aspect term extraction task of SemEval.

Representation learning has been introduced and achieved

success in natural language processing (NLP)

[

Bengio et al.,

⇤

Corresponding author: Ming Zhang

2013

]

, such as word embeddings

[

Mikolov et al., 2013b

]

and

structured embeddings of knowledge bases

[

Bordes et al.,

2011

]

. It learns distributed representations for text in differ-

ent granularities, such as words, phrases and sentences, and

reduces data sparsity compared with the conventional one-hot

representation. The distributed representations have been re-

ported to be useful in many NLP tasks

[

Turian et al., 2010;

Collobert et al., 2011

]

In this paper, we focus on representation learning for as-

pect term extraction under an unsupervised framework. Be-

sides words, dependency paths, which have been shown to be

important clues in aspect term extraction

[

Qiu et al., 2011

]

are also taken into consideration. Inspired by the repre-

sentation learning of knowledge bases

[

Bordes et al., 2011;

Neelakantan et al., 2015; Lin et al., 2015

]

that embeds both

entities and relations into a low-dimensional space, we learn

distributed representations of words and dependency paths

from the text corpus. Speciﬁcally, the optimization objective

is formalized as w

+ r ⇡ w

. In the triple (w

,r), w

and w

are words, r is the corresponding dependency path

consisting of a sequence of grammatical relations. The re-

current neural network

[

Mikolov et al., 2010

]

is used to learn

the distributed representations of dependency paths. Further-

more, the word embeddings are enhanced by linear context

information in a multi-task learning manner.

The learned embeddings of words and dependency paths

are utilized as features in CRF for aspect term extraction.

The embeddings are real values that are not necessarily in

a bounded range

[

Turian et al., 2010

]

. We therefore ﬁrstly

map the continuous embeddings into the discrete embeddings

and make them more appropriate for the CRF model. Then,

we construct the embedding features which include the target

word embedding, linear context embedding and dependency

context embedding for aspect term extraction. We conduct

experiments on the SemEval datasets and obtain comparable

performances with the top systems. To demonstrate the effec-

tiveness of the proposed embedding method, we also compare

our method with other state-of-the-art models. With the same

feature settings, our approach achieves better results. More-

over, we perform a qualitative analysis to show the effective-

ness of the learned word and dependency path embeddings.

The contributions of this paper are two-fold. First, we use

the dependency path to link words in the embedding space

for distributed representation learning of words and depen-

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

2979

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38601103

粉丝: 7
资源: 945

无监督词与依存路径嵌入的方面术语抽取

文本提取单词.，主要用于需要通过英语文本导出需要记忆的词条

C++英汉词典。可查询修改删除添加词条。内附词库。词库是牛津词典。

日语词条库 Genius日英词典

BERT-AttributeExtraction：在KnowledgeGraph中使用BERT进行属性提取。 微调和特征提取。使用基于伯伯的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取

steem-markdown-only:更少的免费Markdown解析器用于词条发布

支付行业缩写词专业词词条表（包含全称和中文）

DeepWalk代码实战-维基百科词条图嵌入可视化

小程序用于垃圾分类，有词条搜索和图像搜索功能 （图像搜索通过人工智能图像识别判断它是什么，然后同样是词条搜索）

udpipe：基于UDPipe自然语言处理工具包的R软件包，用于标记化，语音标记，词法化和依存性分析

将查询内字词依存关系合并到方面查询语言模型中

最新资源

BERT-AttributeExtraction：在KnowledgeGraph中使用BERT进行属性提取。微调和特征提取。使用基于伯伯的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取

小程序用于垃圾分类，有词条搜索和图像搜索功能（图像搜索通过人工智能图像识别判断它是什么，然后同样是词条搜索）