arXiv:1909.00204v2 [cs.CL] 5 Sep 2019
NEZHA: NEURAL CONTEXTUALIZED REPRESENTATION FOR
CHINESE LANGUAGE UNDERSTANDING
TECHNICAL REPORT
Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao,
Yasheng Wang, Jiashu Lin
∗
, Xin Jiang, Xiao Chen, Qun Liu
Noah’s Ark Lab,
∗
HiSilicon, Huawei Tech nologies
{wei.junqiu1, renxiaozhe, lixiaoguang11, wenyong.huang, liao.yi,
wangyasheng, linjiashu, jiang.xin, chen.xiao2, qun.liu}@huawei.com
September 6, 2019
ABSTRACT
The pre-trained language models have achieved great successes in various natural language under-
standing (N LU) tasks due to its capacity to capture the deep contextualize d information in text by
pre-train ing on large-scale corpora. In this technical report, we present our practice of pre-training
languag e models named NEZHA (NEural contextualiZed representation for CHinese lAnguage un-
derstandin g) on Chinese corpora an d finetuning for the Chinese NLU tasks. The current version of
NEZHA is based on BERT [1] with a collection of proven improvements, which include Functional
Relative Positional Encoding as an effective positional encoding scheme, Whole Word Masking strat-
egy, Mixed Precision Training and the LAMB Optimizer in tra ining the models. The experimen tal
results show that NEZHA achieves the state-of-the-art performances when finetuned on several rep-
resentative Chinese tasks, including named entity recognition (People’s Daily NER), sentence match-
ing (LCQMC), Chinese sentiment classification (ChnSenti) and natural language inference (XNLI).
Keywords Pre-trained Language Models · NEZHA · Chinese Language Understanding
1 Introduction
Pre-trained lan guage models such as ELMo [2], BERT [1], ERNIE-Baidu [3, 4], ERNIE-Tsinghua [5], XLNet [6],
RoBERTa [7] and MegatronLM
1
have de monstrated remarkable successes in modeling co ntextualized word represen-
tations b y utilizing the massive amount of training text. As a fundamental technique in n a tural language processing
(NLP), the language m odels pre-tra ined on text could be easily transferred to lear n downstream NLP tasks with fine-
tuning, which achieve the state-of-the-art performances on many tasks including sentiment analysis, machine reading
comprehension, sentence matching, named entity recognition and n a tural language inference.
The existing pre-trained lang uage models are mostly learned from English corpora (e.g., Book sCorpus and English
Wikipedia). There are several attempts to train the models spec ifica lly for the Chinese language, includin g Google’s
BERT [1] for Chinese, ERNIE-Baidu [3, 4] and BERT-WWM [8]. All of the models are based on Transformer [9]
and tr a ined on two unsupervised tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP). In the
MLM ta sk, th e model learns to recover the masked words in the training sen te nces. In the NSP task, it tries to predict
whether o ne sentence is the next sentence of the other. One of the main differences among the Chinese models lies
their word masking strategy in the MLM task. Google’s BERT masks each Chinese character or WordPiece token [10]
indepen dently. ERNIE-Baidu furthe r makes the MLM task more challenging by masking the entities or phrases in a
sentence as a whole, wh ere each entity or phrase may c ontain multiple characters or tokens. BERT-WWM takes a
similar strategy called Whole Word Masking (WWM), which enforces that all the tokens belonging to a Chinese word
should be masked together. Besides, in the most recently published ERNIE-Baidu 2.0 [4], additional pre-training tasks
such as Token-Document Relation Prediction and Sentence Reordering, are also incorporated.
1
https://nv-adlr.github.io/MegatronLM