BERT：深度双向Transformer预训练语言理解模型

需积分: 0 179 浏览量更新于2024-07-01 收藏 697KB PDF 举报

"BERT模型是深度双向Transformer编码器的语言理解预训练方法，旨在通过无标签文本的联合条件训练，生成左、右上下文双向表示。该模型在多种自然语言处理任务上表现出色，显著提升了任务性能。" BERT，全称为Bidirectional Encoder Representations from Transformers，是由Google AI Language团队提出的新型语言表示模型。它与先前的语言表示模型（如Peters等人在2018a年和Radford等人在2018年的工作）不同，BERT的核心创新在于其设计用于预训练深度双向的表示。传统模型通常只关注单向的语境，而BERT则同时考虑左右两侧的上下文信息，这在所有层中都是如此。预训练阶段，BERT利用大规模的未标注文本数据进行学习，以建立丰富的语言理解能力。一旦预训练完成，只需添加一个额外的输出层就可以将BERT模型微调到特定任务，例如问答或语言推理，而无需对原有架构进行大规模修改。这种方法极大地简化了模型的迁移学习过程，降低了任务定制的复杂性。 BERT模型在实践中表现出极强的实用性和效力。在一系列自然语言处理任务上，BERT刷新了记录，包括GLUE（General Language Understanding Evaluation）基准测试得分提升至80.5%，比之前的最佳成绩提高了7.7个百分点，以及在MultiNLI（Multi-Genre Natural Language Inference）任务上的准确度提高到86%。除了上述任务，BERT还在其他多项任务中取得显著成果，如SQuAD（Stanford Question Answering Dataset）的阅读理解任务，CoNLL-2003命名实体识别任务，以及STS-B（Semantic Textual Similarity Benchmark）等语义相似度评估任务。这些成就证明了BERT模型在理解和生成自然语言的能力上的强大，并且它已成为NLP领域的重要工具，推动了后续许多研究的发展。

Input/Output Representations To make BERT

handle a variety of down-stream tasks, our input

representation is able to unambiguously represent

both a single sentence and a pair of sentences

(e.g., h Question, Answer i) in one token sequence.

Throughout this work, a “sentence” can be an arbi-

trary span of contiguous text, rather than an actual

linguistic sentence. A “sequence” refers to the in-

put token sequence to BERT, which may be a sin-

gle sentence or two sentences packed together.

We use WordPiece embeddings (Wu et al.,

2016) with a 30,000 token vocabulary. The ﬁrst

token of every sequence is always a special clas-

siﬁcation token ([CLS]). The ﬁnal hidden state

corresponding to this token is used as the ag-

gregate sequence representation for classiﬁcation

tasks. Sentence pairs are packed together into a

single sequence. We differentiate the sentences in

two ways. First, we separate them with a special

token ([SEP]). Second, we add a learned embed-

ding to every token indicating whether it belongs

to sentence A or sentence B. As shown in Figure 1,

we denote input embedding as E, the ﬁnal hidden

vector of the special [CLS] token as C ∈ R

and the ﬁnal hidden vector for the i

input token

as T

∈ R

For a given token, its input representation is

constructed by summing the corresponding token,

segment, and position embeddings. A visualiza-

tion of this construction can be seen in Figure 2.

3.1 Pre-training BERT

Unlike Peters et al. (2018a) and Radford et al.

(2018), we do not use traditional left-to-right or

right-to-left language models to pre-train BERT.

Instead, we pre-train BERT using two unsuper-

vised tasks, described in this section. This step

is presented in the left part of Figure 1.

Task #1: Masked LM Intuitively, it is reason-

able to believe that a deep bidirectional model is

strictly more powerful than either a left-to-right

model or the shallow concatenation of a left-to-

right and a right-to-left model. Unfortunately,

standard conditional language models can only be

trained left-to-right or right-to-left, since bidirec-

tional conditioning would allow each word to in-

directly “see itself”, and the model could trivially

predict the target word in a multi-layered context.

former is often referred to as a “Transformer encoder” while

the left-context-only version is referred to as a “Transformer

decoder” since it can be used for text generation.

In order to train a deep bidirectional representa-

tion, we simply mask some percentage of the input

tokens at random, and then predict those masked

tokens. We refer to this procedure as a “masked

LM” (MLM), although it is often referred to as a

Cloze task in the literature (Taylor, 1953). In this

case, the ﬁnal hidden vectors corresponding to the

mask tokens are fed into an output softmax over

the vocabulary, as in a standard LM. In all of our

experiments, we mask 15% of all WordPiece to-

kens in each sequence at random. In contrast to

denoising auto-encoders (Vincent et al., 2008), we

only predict the masked words rather than recon-

structing the entire input.

Although this allows us to obtain a bidirec-

tional pre-trained model, a downside is that we

are creating a mismatch between pre-training and

ﬁne-tuning, since the [MASK] token does not ap-

pear during ﬁne-tuning. To mitigate this, we do

not always replace “masked” words with the ac-

tual [MASK] token. The training data generator

chooses 15% of the token positions at random for

prediction. If the i-th token is chosen, we replace

the i-th token with (1) the [MASK] token 80% of

the time (2) a random token 10% of the time (3)

the unchanged i-th token 10% of the time. Then,

will be used to predict the original token with

cross entropy loss. We compare variations of this

procedure in Appendix C.2.

Task #2: Next Sentence Prediction (NSP)

Many important downstream tasks such as Ques-

tion Answering (QA) and Natural Language Infer-

ence (NLI) are based on understanding the rela-

tionship between two sentences, which is not di-

rectly captured by language modeling. In order

to train a model that understands sentence rela-

tionships, we pre-train for a binarized next sen-

tence prediction task that can be trivially gener-

ated from any monolingual corpus. Speciﬁcally,

when choosing the sentences A and B for each pre-

training example, 50% of the time B is the actual

next sentence that follows A (labeled as IsNext),

and 50% of the time it is a random sentence from

the corpus (labeled as NotNext). As we show

in Figure 1, C is used for next sentence predic-

tion (NSP).

Despite its simplicity, we demon-

strate in Section 5.1 that pre-training towards this

task is very beneﬁcial to both QA and NLI.

The ﬁnal model achieves 97%-98% accuracy on NSP.

The vector C is not a meaningful sentence representation

without ﬁne-tuning, since it was trained with NSP.

剩余15页未读，继续阅读

设计师马丁

粉丝: 21
资源: 299

BERT：深度双向Transformer预训练语言理解模型

最新资源