深度学习在命名实体识别中的应用综述

需积分: 50 89 浏览量更新于2024-07-16 1 收藏 944KB PDF 举报

"这篇PDF论文是对深度学习在命名实体识别（NER）领域的最新进展的综述，由Jing Li、Aixin Sun、Jianglei Han和Chenliang Li撰写，发表于IEEE Transactions on Knowledge and Data Engineering，2020年。文章探讨了如何利用连续实值向量表示和非线性处理提升NER系统的性能，并对现有深度学习技术进行了全面的分析和分类。" 深度学习命名实体识别是自然语言处理中的一个关键任务，它的目标是识别文本中提及的特定实体，如人名、地点、组织等，并将其归类到预定义的类别中。这个任务对于问答系统、文本摘要和机器翻译等应用至关重要。早期的NER系统虽然能取得不错的识别准确率，但往往需要大量的人工努力去设计规则和特征。随着深度学习的发展，尤其是由于连续实值向量表示（如词嵌入和预训练模型如Word2Vec、GloVe以及BERT、ELEPHANT等）和非线性处理能力的引入，NER系统的性能得到了显著提升。这些技术能够捕捉到文本中的语义关系，从而更有效地识别和理解实体。例如，卷积神经网络（CNN）可以捕获局部上下文信息，而循环神经网络（RNN）和长短期记忆网络（LSTM）则擅长处理序列数据，捕捉上下文依赖。此外，注意力机制进一步增强了模型对关键信息的聚焦能力。论文中，作者首先介绍了NER的相关资源，包括标注过的NER语料库，如CoNLL、OntoNotes和ACE，以及现成的NER工具，如Stanford NER和Spacy。接着，他们将现有的深度学习方法系统地分为几类：基于序列的模型（如LSTM-CRF）、基于结构的模型（如图神经网络）、基于注意力的模型以及预训练模型的运用。每个类别都包含了不同的架构和方法，它们各自有其优势和适用场景。深度学习在NER中的应用还涉及到多任务学习、迁移学习以及联合模型等技术。多任务学习允许模型同时优化多个相关的任务，提高泛化能力；迁移学习则利用大规模预训练模型的通用知识来改进特定任务的性能；联合模型则尝试同时解决实体识别和关系抽取等任务，提高整体系统效率。最后，论文讨论了当前面临的挑战和未来的研究方向，比如处理低资源语言、跨领域NER以及提高模型的可解释性。深度学习在NER领域的持续发展为自然语言处理提供了更强大的工具，促进了信息提取和理解的准确性与效率。

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020 4

Precision measures the ability of a N ER system to present

only correct entities, and Recall measures the ability of a

NER system to recognize all entities in a corpus.

Precision =

T P

T P + F P

Recall =

T P

T P + F N

F-score is the harmonic mean of precision a nd recall, and

the balanced F-score is most commonly u sed:

F-score = 2 ×

Precision × Recall

Precision + Recall

As most of NER systems involve multiple entity types,

it is often required to assess the performance across all

entity classes. Two measures are commonly used for this

purpose: macro-averaged F-score and micro-averaged F-

score. Macro-averaged F-score computes the F-score inde-

pendently for each entity type, then takes the average (hence

treating all entity types equally). Micro-a veraged F-score

aggregates t he contributions of entities from all classes to

compute the average (treating all entities equally). The latter

can be hea vily affected by the quality of recognizing e nt ities

in large classes in the corpus.

2.3.2 Relaxed-match Evaluation

MUC-6 [

10] deﬁnes a relaxed-match evaluation: a correct

type is credited if an entity is assigned its correct type

regardless its boundaries as long as there is an overlap

with ground truth boundaries; a correct boundary is cred-

ited regardless an e nt ity’s type ass ignme nt . Then ACE [

12]

proposes a more comp le x evaluation procedure. It resolves a

few issues like partial match and wrong type, and considers

subtypes of named entitie s. However, it is problematic be-

cause the ﬁnal scores are comparable only when parameters

are ﬁxed [

1], [22], [23]. Complex evaluation methods are not

intuitive and make error analysis difﬁcult. Thus, complex

evaluation methods are not widely used in recent studies.

2.4 Traditional Approaches to NER

Traditional approaches to NER a re broadly classiﬁed into

three main streams: rule-based, unsupervised learning, and

feature-based supervised learning approaches [

1], [26].

2.4.1 Rule-based Approaches

Rule-based NER systems rely on hand-crafted rules. Rules

can be designed based on domain-sp e ciﬁc gazetteers [

9],

[

42] and syntactic-lexical patterns [43]. Kim [44] proposed

to use Brill rule inference approach for speech input. This

system generates rules automatically based on Brill’s part-

of-speech t agger. In biomedical domain, Hanisch et al. [

45]

proposed ProMiner, which leverages a pre-processed syn-

onym dictionary to identify protein mentions and potential

gene in biomedical text. Quimbaya et al. [

46] proposed

a dictionary-based approach for NER in electronic health

records. Experimental results show the approach improves

recall while having limited impact on precision.

Some other well-known rule-based NER syst e m s in-

clude LaSIE-II [

47], NetOwl [48], Facile [49], SAR [50],

FASTUS [51], and LTG [52] systems. These systems are

mainly ba sed on ha nd-crafted sem antic and syntactic rules

to recognize entities. Rule-based systems work very well

when lex icon is ex haustive. Due to domain-speciﬁc rules

and incomplete dictionaries, high precision and low recall

are often observed from such systems, and the systems

cannot be transferred to other domains.

2.4.2 Unsupervised Learning Approaches

A typical approach of unsupervised learning is cluster-

ing [

1]. Clustering-based NER systems extract named en-

tities from the clustered groups based on context similarity.

The key idea is tha t lexical resources, lexical patterns, and

statistics computed on a large corpus can be used to infer

mentions of named entities. Collins et al. [53] observed

that use of unlabeled data reduces the requirements for

supervision to just 7 simple “seed” rules. The authors t hen

presented t wo unsupervised algorithms for name d entity

classiﬁcation. Similarly, KNOWITALL [

9] leveraged a set

of predicate names as input and bootstraps its recognition

process from a small set of generic extraction pa tterns.

Nadeau et al. [

54] proposed an unsupervised system for

gazetteer building and named entity ambiguity resolution.

This system combines entity extraction and disambiguation

based on simple yet highly effective heuristics. In addi-

tion, Zhang and Elhadad [

43] proposed an unsupervised

approach to extracting named entities from biomedical text.

Instead of supervision, their model resorts to terminolo-

gies, corpus statistics (e.g., inverse document frequency

and context vectors) and shallow syntactic knowledge (e.g.,

noun phrase chunking). Experiments on two mainstream

biomedical datasets demonstrate the effectiveness and gen-

eralizability of their unsupervised approach.

2.4.3 Feature-based Supervised Learning Approaches

Applying supervised learning, NER is cast to a multi-class

classiﬁcation or sequence labeling task. Given annotated

data samples, features are carefully designed to represent

each training example. Machine learning algorithms are

then utilized to learn a model to recognize similar patterns

from unseen data.

Feature engineering is critical in supervised NER sys-

tems. Feature vector representation is an a bstraction over

text where a word is represented by one or many Boolean,

numeric, or nominal values [

1], [55]. Word-level features

(e.g., case, morphology, and part-of-speech tag) [

56]–[58],

list lookup features (e.g. , Wikipedia gazetteer and DBpedia

gazetteer) [

59]–[62], and document and corpus features (e.g.,

local syntax and multiple occurrences) [

63]–[66] ha ve been

widely used in various supervised NER systems. More

feature designs a re discussed in [

1], [28], [67]

Based on these features, many machine learning algo-

rithms ha ve been applied in supervised NER, including

Hidden Markov Models (HMM) [

68], Decision Trees [69],

Maximum Entrop y Models [

70], Support Vector Machines

(SVM) [

71], and Conditional Random Fields (CRF) [72].

Bikel et al. [73], [74] proposed the ﬁrst HMM-based NER

system, named IdentiFinder, t o identify a nd classify names,

dates, time expressions, and numerical quant ities. In addi-

tion, Szarvas e t al. [

75] developed a multilingual NER sys-

tem by using C4.5 decision tree and AdaBoostM1 learning

algorithm. A major merit is that it provides an opportunity

to t rain several independent decision tree classiﬁers through

different subsets of features then combine t heir decisions

剩余19页未读，继续阅读

syp_net

粉丝: 158

深度学习在命名实体识别中的应用综述

TKDE-2018论文源代码：数据市场的真实性和隐私保护

电子科技大学教授邵俊明：数据挖掘理论与交叉学科应用

数据库学习必看：权威期刊与网站推荐

TKDE-2018-TPDM:这是我们论文的源代码

最新《异构网络表示学习》2020综述论文.pdf

matlab花代码-USPEC_USENC:TKDE2020：超可扩展光谱聚类和集成聚类（U-SPEC和U-SENC）#大规模光谱聚类##大规

matlab“MEL_高维特征选择的高效多任务进化学习”代码——[IEEE知识与数据工程学报(TKDE 24)].zip

论文研究-不确定数据挖掘技术研究进展 .pdf

week1&2参考.pdf

中国计算机学会推荐国际学术刊物与会议计算机科学理论.pdf

最新资源