BERT预训练模型：深度双向Transformer语言理解的突破

需积分: 0 59 浏览量更新于2024-07-17 收藏 757KB PDF 举报

BERT（Bidirectional Encoder Representations from Transformers）是一项革命性的自然语言处理(NLP)技术，由Google AI Language团队的Jacob Devlin、Ming-Wei Chang、Kenton Lee和Kristina Toutanova在2018年的论文"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"中提出。该论文的核心创新在于，BERT模型的设计目标是通过无监督学习，即仅使用未标注文本，训练深度双向表示，这与先前的模型如Peters et al. (2018) 和 Radford et al. (2018) 的单向或部分双向方法不同。 BERT的预训练过程非常关键，它在所有层都同时考虑左和右上下文信息，使得模型能够捕捉到句子的前后文关系。这种设计使得预训练的BERT模型具有通用性，只需添加一个额外的输出层，就能在多种任务上实现最先进的性能，而无需针对特定任务进行大量架构修改。这表明BERT的强大之处在于其灵活性和适应性。 BERT在11个自然语言处理任务上的表现显著，包括提升了GLUE基准测试的得分至80.5%，相较于前一代技术有7.7%的绝对提升。在MultiNLI（多语义自然语言推理）任务上，BERT也达到了惊人的86%的准确率，这无疑证明了其在理解和处理复杂语言理解问题方面的卓越能力。总结来说，BERT是一种强大的语言模型，它通过深度双向编码器和Transformer架构，实现了跨任务的高效迁移学习。它的成功不仅体现在理论上的创新，更体现在实际应用中的广泛影响力，为NLP领域带来了实质性的进步。无论是文本分类、问答系统还是语义理解任务，BERT都成为了研究者和开发者的首选工具。

Input/Output Representations To make BERT

handle a variety of down-stream tasks, our input

representation is able to unambiguously represent

both a single sentence and a pair of sentences

(e.g., h Question, Answer i) in one token sequence.

Throughout this work, a “sentence” can be an arbi-

trary span of contiguous text, rather than an actual

linguistic sentence. A “sequence” refers to the in-

put token sequence to BERT, which may be a sin-

gle sentence or two sentences packed together.

We use WordPiece embeddings (Wu et al.,

2016) with a 30,000 token vocabulary. The ﬁrst

token of every sequence is always a special clas-

siﬁcation token ([CLS]). The ﬁnal hidden state

corresponding to this token is used as the ag-

gregate sequence representation for classiﬁcation

tasks. Sentence pairs are packed together into a

single sequence. We differentiate the sentences in

two ways. First, we separate them with a special

token ([SEP]). Second, we add a learned embed-

ding to every token indicating whether it belongs

to sentence A or sentence B. As shown in Figure 1,

we denote input embedding as E, the ﬁnal hidden

vector of the special [CLS] token as C ∈ R

and the ﬁnal hidden vector for the i

input token

as T

∈ R

For a given token, its input representation is

constructed by summing the corresponding token,

segment, and position embeddings. A visualiza-

tion of this construction can be seen in Figure 2.

3.1 Pre-training BERT

Unlike Peters et al. (2018a) and Radford et al.

(2018), we do not use traditional left-to-right or

right-to-left language models to pre-train BERT.

Instead, we pre-train BERT using two unsuper-

vised tasks, described in this section. This step

is presented in the left part of Figure 1.

Task #1: Masked LM Intuitively, it is reason-

able to believe that a deep bidirectional model is

strictly more powerful than either a left-to-right

model or the shallow concatenation of a left-to-

right and a right-to-left model. Unfortunately,

standard conditional language models can only be

trained left-to-right or right-to-left, since bidirec-

tional conditioning would allow each word to in-

directly “see itself”, and the model could trivially

predict the target word in a multi-layered context.

former is often referred to as a “Transformer encoder” while

the left-context-only version is referred to as a “Transformer

decoder” since it can be used for text generation.

In order to train a deep bidirectional representa-

tion, we simply mask some percentage of the input

tokens at random, and then predict those masked

tokens. We refer to this procedure as a “masked

LM” (MLM), although it is often referred to as a

Cloze task in the literature (Taylor, 1953). In this

case, the ﬁnal hidden vectors corresponding to the

mask tokens are fed into an output softmax over

the vocabulary, as in a standard LM. In all of our

experiments, we mask 15% of all WordPiece to-

kens in each sequence at random. In contrast to

denoising auto-encoders (Vincent et al., 2008), we

only predict the masked words rather than recon-

structing the entire input.

Although this allows us to obtain a bidirec-

tional pre-trained model, a downside is that we

are creating a mismatch between pre-training and

ﬁne-tuning, since the [MASK] token does not ap-

pear during ﬁne-tuning. To mitigate this, we do

not always replace “masked” words with the ac-

tual [MASK] token. The training data generator

chooses 15% of the token positions at random for

prediction. If the i-th token is chosen, we replace

the i-th token with (1) the [MASK] token 80% of

the time (2) a random token 10% of the time (3)

the unchanged i-th token 10% of the time. Then,

will be used to predict the original token with

cross entropy loss. We compare variations of this

procedure in Appendix C.2.

Task #2: Next Sentence Prediction (NSP)

Many important downstream tasks such as Ques-

tion Answering (QA) and Natural Language Infer-

ence (NLI) are based on understanding the rela-

tionship between two sentences, which is not di-

rectly captured by language modeling. In order

to train a model that understands sentence rela-

tionships, we pre-train for a binarized next sen-

tence prediction task that can be trivially gener-

ated from any monolingual corpus. Speciﬁcally,

when choosing the sentences A and B for each pre-

training example, 50% of the time B is the actual

next sentence that follows A (labeled as IsNext),

and 50% of the time it is a random sentence from

the corpus (labeled as NotNext). As we show

in Figure 1, C is used for next sentence predic-

tion (NSP).

Despite its simplicity, we demon-

strate in Section 5.1 that pre-training towards this

task is very beneﬁcial to both QA and NLI.

The ﬁnal model achieves 97%-98% accuracy on NSP.

The vector C is not a meaningful sentence representation

without ﬁne-tuning, since it was trained with NSP.

剩余15页未读，继续阅读

Demongle_lataf

粉丝: 2
资源: 1

BERT预训练模型：深度双向Transformer语言理解的突破

ZY_Entity-0.0.1-SNAPSHOT

BERT中文翻译PDF版.pdf

Oracle跟踪事件

BERT_model.h5找不到

bert_config.json在哪下载

bert_tokenizer.tokenize

翻译以下代码。。embedding_layer = bert_model.get_layer('embeddings') encoder_layer = bert_model.get_layer('encoder')

FileNotFoundError: [Errno 2] No such file or directory: 'model/BERT_model.h5'

最新资源