ERNIE：融合知识图谱的增强语言表示提升NLP性能

需积分: 21 41 浏览量更新于2024-09-08 1 收藏 1.65MB PDF 举报

在当前的自然语言处理（NLP）领域，预训练模型如BERT已经在大规模文本语料库上展现出强大的语言理解能力，能够捕捉丰富的语义模式，并通过微调持续提升各种任务的表现。然而，尽管这些模型在处理文本数据方面表现出色，但它们往往忽视了知识图谱（KGs）这一宝贵的资源。知识图谱提供了结构化的丰富知识事实，对于深化语言理解具有显著作用。本文提出了ERNIE（Enhanced Language Representation with Informative Entities），即增强语言表示模型，它融合了大型文本语料和知识图谱的信息，旨在改进语言表示并充分利用词汇、句法以及外部知识。作者认为，知识图谱中的信息丰富的实体能够为语言表示提供额外的知识支持，从而提升模型的泛化能力和任务适应性。 ERNIE的训练策略是将大规模文本数据与知识图谱相结合，使其能够在一次训练过程中同时学习文本上下文的语义信息和知识图谱中的实体关系。这样，模型能够更好地理解实体之间的关联，从而在知识驱动的任务中，如问答、关系抽取等，展现出显著的优势。此外，尽管专注于知识增强，ERNIE在处理常规NLP任务时，如情感分析、文本分类等，也能达到或接近最先进的BERT模型的性能。实验结果证明了ERNIE的有效性，它不仅在知识密集型任务上取得突破，而且在保持高性能的同时，还展示了良好的通用性。ERNIE的源代码可以在[https://github.com/thunlp/ERNIE](https://github.com/thunlp/ERNIE)获取，这表明该模型对于研究者和开发者来说是一个有价值的工具，可以帮助他们在实际应用中提升NLP系统的知识理解和表达能力。 ERNIE是一个重要的里程碑，它展示了如何有效地整合文本和知识图谱信息，以增强语言模型的性能。这对于未来NLP技术的发展，特别是那些需要深度理解和外部知识的场景，具有重要的指导意义。

ERNIE: Enhanced Language Representation with Informative Entities

Zhengyan Zhang

1,2,3∗

, Xu Han

1,2,3∗

, Zhiyuan Liu

1,2,3†

, Xin Jiang

, Maosong Sun

1,2,3

, Qun Liu

Department of Computer Science and Technology, Tsinghua University, Beijing, China

Institute for Artiﬁcial Intelligence, Tsinghua University, Beijing, China

State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China

Huawei Noahs Ark Lab, Huawei Technologies

Abstract

Neural language representation models such

as BERT pre-trained on large-scale corpora

can well capture rich semantic patterns from

plain text, and be ﬁne-tuned to consistently im-

prove the performance of various NLP tasks.

However, the existing pre-trained language

models rarely consider incorporating knowl-

edge graphs (KGs), which can provide rich

structured knowledge facts for better language

understanding. We argue that informative en-

tities in KGs can enhance language represen-

tation with external knowledge. In this pa-

per, we utilize both large-scale textual cor-

pora and KGs to train an enhanced language

representation model (ERNIE), which can

take full advantage of lexical, syntactic, and

knowledge information simultaneously. The

experimental results have demonstrated that

ERNIE achieves signiﬁcant improvements on

various knowledge-driven tasks, and mean-

while is comparable with the state-of-the-

art model BERT on other common NLP

tasks. The source code of this paper can

be obtained from https://github.com/

thunlp/ERNIE.

1 Introduction

Pre-trained language representation models, in-

cluding feature-based (Mikolov et al., 2013; Pen-

nington et al., 2014; Peters et al., 2017, 2018) and

ﬁne-tuning (Dai and Le, 2015; Howard and Ruder,

2018; Radford et al., 2018; Devlin et al., 2018)

approaches, can capture rich language informa-

tion from text and then beneﬁt many NLP appli-

cations. BERT (Devlin et al., 2018), as one of the

most recently proposed models, obtains the state-

of-the-art results on various NLP applications by

simple ﬁne-tuning, including named entity recog-

nition (Sang and De Meulder, 2003), question

∗

indicates equal contribution

†

Corresponding author: Z.Liu(liuzy@tsinghua.edu.cn)

is_ais_a

Song Book

author

composer

Bob Dylan

Chronicles:

Volume One

Blowin’ in the wind

Songwriter Writer

is_a

Bob Dylan wrote Blowin’ in the Wind in 1962, and wrote Chronicles: Volume One in 2004.

Figure 1: An example of incorporating extra

knowledge information for language understand-

ing. The solid lines present the existing knowl-

edge facts. The red dotted lines present the facts

extracted from the sentence in red. The blue dot-

dash lines present the facts extracted from the sen-

tence in blue.

answering (Rajpurkar et al., 2016; Zellers et al.,

2018), natural language inference (Bowman et al.,

2015), and text classiﬁcation (Wang et al., 2018).

Although pre-trained language representation

models have achieved promising results and

worked as a routine component in many NLP

tasks, they neglect to incorporate knowledge in-

formation for language understanding. As shown

in Figure 1, without knowing Blowin’ in the Wind

and Chronicles: Volume One are song and book

respectively, it is difﬁcult to recognize the two oc-

cupations of Bob Dylan, i.e., songwriter and

writer, in the entity typing task. Furthermore,

it is nearly impossible to extract the ﬁne-grained

relations, such as composer and author in

the relation classiﬁcation task. For the existing

pre-trained language representation models, these

two sentences are syntactically ambiguous, like

“UNK wrote UNK in UNK”. Hence, considering

rich knowledge information can lead to better lan-

guage understanding and accordingly beneﬁts var-

ious knowledge-driven applications, e.g. entity

typing and relation classiﬁcation.

arXiv:1905.07129v1 [cs.CL] 17 May 2019

下载后可阅读完整内容，剩余9页未读，立即下载

Jayxp

粉丝: 6
资源: 137

ERNIE：融合知识图谱的增强语言表示提升NLP性能

lcqmc数据集,lcqmc数据集效果,Python

lcqmc-NLP数据资源.rar

LCQMC数据集-语义相似度数据集

推荐30个以上比较好的自然语言处理模型以及github源码？

给我推荐20个比较流行的nlp预训练模型源码

推荐30个以上比较好的nlp意图识别模型源码地址？

推荐30个以上比较好的中文nlp意图识别模型源码？

(ernie) PS D:\Python\程序练习> pip install --upgrade pdfminer Requirement already satisfied: pdfminer in d:\anaconda3\envs\ernie\lib\site-packages (20191125) Requirement already satisfied: pycryptodome in d:\anaconda3\envs\ernie\lib\site-packages (from pdfminer) (3.17)

国内有类似chatgpt的模型源码吗？

最新资源