自举支持向量驱动的标签传播：命名实体语义关系提取

180 浏览量更新于2024-07-15 收藏 817KB PDF 举报

"通过自举支持向量进行标签传播，以提取命名实体之间的语义关系" 本文探讨了一种半监督学习策略，旨在有效地利用有限的标注数据，来提升命名实体间语义关系提取的性能。在命名实体识别任务中，语义关系提取是识别和理解文本中实体之间关系的重要步骤，例如人名、地名和组织名等。传统的监督学习方法依赖大量人工标注的数据，而这种新的方法则旨在减少对标注数据的依赖。该方法的核心是结合支持向量机（SVM）的自举技术和标签传播（Label Propagation, LP）算法。首先，通过SVM的自举（Bootstrapping）过程，从所有数据（包括已标注和未标注）中挑选出一部分重要的、具有代表性的支持向量。这些支持向量是基于特征的，并且被赋予了相应的权重，它们能够在未标注数据中发挥指导作用。接着，自举后的支持向量与未标注数据中的“硬实例”一起，被输入到LP算法中，以此来推断未见实例的类别，即它们所属的语义关系。在实际操作中，这个过程会不断迭代，每次迭代都会从新预测的标签中学习，直到模型收敛或者达到预设的迭代次数。这样，未标注数据逐渐被赋予了语义标签，从而丰富了模型的学习材料。通过对ACE RDC语料库的实验评估，该方法显示出了明显的优越性，证明了SVM引导和标签传播的结合可以显著提高关系提取的准确性和效率。此外，与传统LP算法相比，该方法还具有计算负担轻的优点，尤其在处理大量标注和未标注数据时，这一优势更加明显。这使得该方法在大数据场景下仍能保持较好的可扩展性和实用性。文章最后指出，作者有权在个人网站或机构存储库中发布其论文的个人版本，但需遵循Elsevier的版权政策，以防止未经授权的复制、分发或销售行为。总结来说，这篇论文提出的自举支持向量和标签传播相结合的方法，为半监督学习提供了新的视角，特别是在命名实体关系提取领域，它不仅提高了模型的性能，还降低了对大量标注数据的依赖，同时减少了计算复杂度，为大规模文本数据的处理提供了可行的解决方案。

Author's personal copy

classiﬁed into two categories: bootstrapping-based (Hearst, 1992; Brin, 1998; Agichtein and Gravano, 2000;

Zhang, 2004; Etzioni et al., 2005; Xu et al., 2007) and LP (label propagation)-based (Chen et al., 2006).

Currently, bootstrapping-based methods dominate semi-supervised learning in semantic relation extra c-

tion. Generally, they work by iteratively classifying unlabeled instances and adding conﬁdently classiﬁed

ones into the labeled data using a model learnt from the augmented labeled data in the previous loop.

As a pioneer, Hearst (1992) used a small set of seed patterns in a bootstrapping fashion to mine pairs of

hypernym–hyponym nouns. Brin (1998) proposed a bootstrapping-based method on top of a self-developed

pattern matching-based classiﬁer to exploit the duality between patterns and relations. Agichtein and

Gravano (2000) shared much in common with Brin (1998). It employed an existing pattern matching-based

classiﬁer (i.e., SNoW as proposed in Carlson et al. (1999)) instead. Zhang (2004) approached relation clas-

siﬁcation by bootstrapping on top of SVM. For a given target relation, Etzioni et al. (2005) bootstrapped a

rule template containing words that describe the class of the arguments (e.g. ‘‘the company”) and a small set

of seed patterns (e.g. ‘‘has acquired”). Xu et al. (2007) bootstrapped from a small set of n-ary relation

instances as seeds to automatically learn pattern rules from parsed data, using a bottom-up pattern

extraction method with a new rule representation composed on top of the rules for projections of the rela-

tion. Although bootstrapping-based methods have achieved certain success in the literature, one problem is

that they may not be able to well capture the inherent natural clustering structure among the unlabeled

data.

As an alternative to the bootstrapping-based methods, Chen et al. (2006) employed a LP-based method in

semantic relation extraction. Compared with bootstrapping, the LP algorithm can eﬀectively exploit the nat-

ural clustering structure in both the labeled and unlabeled data. The rationale behind this algorithm is that the

instances in high-density areas tend to carry the same labels. The LP algorithm has also been success fully

applied in other NLP applications, such as word sense disambiguation (Niu et al., 2005), text classiﬁcation

(Szummer and Jaakkola, 2001; Blum and Chawla , 2001; Belk in and Niyogi, 2002; Zhu and Ghahramani,

2002; Zhu et al., 2003; Blum et al., 2004), and information retrieval (Yang et al., 2006). However, one problem

is its huge computational burden, especially when a large amount of labeled and unlabeled data is taken into

consideration.

Besides, Bunescu and Mooney (2007) learnt to extract relations from the Web using minimal supervision by

extracting bags of sentences containing the pairs, given a few pairs of relation instances (both positive and

negative). It transformed this weakly-labeled multiple instance learning problem into a standard supervised

problem by properly controlling the relative inﬂuence of false negative vs. false positive and eliminating

two types of bias due to words that are correlated with the arguments of a relation instance and those that

are speciﬁc to a relation instance. Therefore, the minimal supervision proposed by Bunescu and Mooney

(2007) actually employed a supervised learning method.

In order to take the advantages of both bootstrapping and label propagation, our proposed method prop-

agates labels via bootstrapped support vectors and the remaining hard unlabeled instances after SVM boo-

strapping. Evaluation on the ACE RDC corpora shows that our method can not only signiﬁcantly reduce

the computational burden in the normal LP algorithm via all the available data but also well capture the nat-

ural clustering structure inherent in both the labeled and unlabeled data via the bootstrapped support vectors

and hard unlabeled instances only.

3. Label propagation with SVM bootstrapping

The idea behind our LP algorithm with SVM bootstrapping is that, instead of propagating labels through

all the available labeled and unlabeled data, our method propagates labels only through critical instances in

both the labeled and unlabeled data. Similar to support vectors in SVM, critical instances in this paper refer to

the sup port vectors in label propagation, which play a critical role in propagating labels. Fig. 1 shows a brief

ﬂow chart in training and testing of our LP algorithm with SVM bootstrapping. The key behind our LP algo-

rithm is how to ﬁnd critical instances in both the labeled and unlabeled data.

In this paper, we use SVM as the underlying classiﬁer to bootstrap a moderate number of weighted support

vectors for this purpose. This is based on an assumption that the natural clustering structure in both the

labeled and unl abeled data can be well preserved through the critical instances, including the weighted support

466 Z. GuoDong et al. / Computer Speech and Language 23 (2009) 464–478

剩余15页未读，继续阅读

weixin_38709312

粉丝: 3
资源: 913

自举支持向量驱动的标签传播：命名实体语义关系提取

Python-LSTMCRF命名实体识别序列标记

自然语言处理 命名实体识别

基于ELMo, tensorflow的中文命名实体标注 Chinese Named Entity Recognition

Lattice LSTM神经网络法中文医学文本命名实体识别模型研究.pdf

语义识别代码语义识别代码

使用深度学习解决NER命名实体识别

门控CNN-CRF模型在中文命名实体识别中的应用

BERT模型在中文命名实体识别中的微调方法

深度学习在中文命名实体识别中的应用现状

命名实体识别实战：机器学习方法与应用全览

最新资源

自然语言处理命名实体识别