RoBERTa预训练方法研究：改进BERT模型性能

版权申诉

116 浏览量更新于2024-08-11 1 收藏 205KB PDF 举报

NLP：RoBERTa预训练方法 RoBERTa是一种基于BERT的预训练方法，旨在提高语言模型的性能。该方法通过使用更大的batch、更大的数据集、更长的序列、动态调整Masking机制和更大的 byte-level BPE来改进BERT模型。首先，让我们了解一下BERT模型的基本概念。BERT（Bidirectional Encoder Representations from Transformers）是一种基于Transformer架构的语言模型，能够学习到语言的上下文信息。BERT模型的训练过程主要包括两个部分：Masked Language Modeling和Next Sentence Prediction。Masked Language Modeling任务是指随机Mask掉一些词语，然后预测这些词语的原始内容。Next Sentence Prediction任务是指预测两个句子是否相邻。然而，RoBERTa的作者发现，BERT模型的训练过程存在一些问题。例如，BERT模型的训练数据集较小，batch size较小，Masking机制不够灵活等。因此，RoBERTa的作者提出了RoBERTa预训练方法，以解决这些问题。 RoBERTa预训练方法的主要贡献在于： 1. 使用更大的batch size：RoBERTa使用更大的batch size来提高训练速度和模型性能。 2. 使用更大的数据集：RoBERTa使用更大的数据集来提高模型的泛化能力。 3. 不再使用NSP任务：RoBERTa不再使用Next Sentence Prediction任务，因为该任务对模型性能的影响不大。 4. 使用更长的序列：RoBERTa使用更长的序列来提高模型对长文本的处理能力。 5. 动态调整Masking机制：RoBERTa使用动态调整Masking机制来提高模型的泛化能力。 6. 使用更大的 byte-level BPE：RoBERTa使用更大的 byte-level BPE来提高模型对特殊字符的处理能力。 RoBERTa预训练方法的优点在于： 1. 提高模型性能：RoBERTa预训练方法可以提高模型的性能，达到state-of-the-art水平。 2. 提高模型泛化能力：RoBERTa预训练方法可以提高模型对不同数据集的泛化能力。 3. 降低模型训练时间：RoBERTa预训练方法可以降低模型的训练时间，提高训练效率。 RoBERTa预训练方法是一种基于BERT的预训练方法，旨在提高语言模型的性能。该方法通过使用更大的batch size、更大的数据集、更长的序列、动态调整Masking机制和更大的 byte-level BPE来改进BERT模型。RoBERTa预训练方法可以提高模型性能、提高模型泛化能力和降低模型训练时间。

optimization hyperparameters, given in Section 2,

except for the peak learning rate and number of

warmup steps, which are tuned separately for each

setting. We additionally found training to be very

sensitive to the Adam epsilon term, and in some

cases we obtained better performance or improved

stability after tuning it. Similarly, we found setting

= 0.98 to improve stability when training with

large batch sizes.

We pretrain with sequences of at most T = 512

tokens. Unlike

Devlin et al. (2019), we do not ran-

domly inject short sequences, and we do not train

with a reduced sequence length for the ﬁrst 90% of

updates. We train only with full-length sequences.

We train with mixed precision ﬂoating point

arithmetic on DGX-1 machines, each with 8 ×

32GB Nvidia V100 GPUs interconnected by In-

ﬁniband (

Micikevicius et al., 2018).

3.2 Data

BERT-style pretraining crucially relies on large

quantities of text.

Baevski et al. (2019) demon-

strate that increasing data size can result in im-

proved end-task performance. Several efforts

have trained on datasets larger and more diverse

than the original BERT (

Radford et al., 2019;

Yang et al., 2019; Zellers et al., 2019). Unfortu-

nately, not all of the additional datasets can be

publicly released. For our study, we focus on gath-

ering as much data as possible for experimenta-

tion, allowing us to match the overall quality and

quantity of data as appropriate for each compari-

son.

We consider ﬁve English-language corpora of

varying sizes and domains, totaling over 160GB

of uncompressed text. We use the following text

corpora:

• BOOKCORPUS (

Zhu et al., 2015) plus English

WIKIPEDIA. This is the original data used to

train BERT. (16GB).

• CC-NEWS, which we collected from the En-

glish portion of the CommonCrawl News

dataset (

Nagel, 2016). The data contains 63

million English news articles crawled between

September 2016 and February 2019. (76GB af-

ter ﬁltering).

• OPENWEBTEXT (Gokaslan and Cohen, 2019),

an open-source recreation of the WebText cor-

We use news-please (

Hamborg et al., 2017) to col-

lect and extract CC-NEWS. CC-NEWS is similar to the RE-

ALNEWS dataset described in

Zellers et al. (2019).

pus described in

Radford et al. (2019). The text

is web content extracted from URLs shared on

Reddit with at least three upvotes. (38GB).

• STORIES, a dataset introduced in Trinh and Le

(2018) containing a subset of CommonCrawl

data ﬁltered to match the story-like style of

Winograd schemas. (31GB).

3.3 Evaluation

Following previous work, we evaluate our pre-

trained models on downstream tasks using the fol-

lowing three benchmarks.

GLUE The General Language Understand-

ing Evaluation (GLUE) benchmark (Wang et al. ,

2019b) is a collection of 9 datasets for evaluating

natural language understanding systems.

Tasks

are framed as either single-sentence classiﬁcation

or sentence-pair classiﬁcation tasks. The GLUE

organizers provide training and development data

splits as well as a submission server and leader-

board that allows participants to evaluate and com-

pare their systems on private held-out test data.

For the replication study in Section

4, we report

results on the development sets after ﬁnetuning

the pretrained models on the corresponding single-

task training data (i.e., without multi-task training

or ensembling). Our ﬁnetuning procedure follows

the original BERT paper (

Devlin et al., 2019).

In Section 5 we additionally report test set re-

sults obtained from the public leaderboard. These

results depend on a several task-speciﬁc modiﬁca-

tions, which we describe in Section

5.1.

SQuAD The Stanford Question Answering

Dataset (SQuAD) provides a paragraph of context

and a question. The task is to answer the question

by extracting the relevant span from the context.

We evaluate on two versions of SQuAD: V1.1

and V2.0 (

Rajpurkar et al., 2016, 2018). In V1.1

the context always contains an answer, whereas in

The authors and their afﬁliated institutions are not in any

way afﬁliated with the creati on of the OpenWebText dataset.

The datasets are: CoLA (

Warstadt et al., 2018),

Stanford Sentiment Treebank (SST) (

Socher et al.,

2013), Microsoft Research Paragraph Corpus

(MRPC) (

Dolan and Brockett, 2005), Semantic Tex-

tual Similarity Benchmark (STS) (

Agirre et al., 2007),

Quora Question Pairs ( Q QP) (

Iyer et al., 2016), Multi-

Genre NLI (MNLI) (

Williams et al., 2018 ) , Question NLI

(QNLI) (

Rajpurkar et al., 2016), Recognizing Textual

Entailment ( RTE) (

Dagan et al., 2006; Bar-Haim et al.,

2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) and

Winograd NLI (WNLI) (

Levesque et al., 2011).

剩余12页未读，继续阅读

方案互联

粉丝: 18
资源: 925

RoBERTa预训练方法研究：改进BERT模型性能

自然语言处理-基于预训练模型的方法-笔记

自然语言处理-基于预训练模型的方法 笔记

问答ChatGPT之后：超大预训练模型的机遇和挑战.pdf

BERT&RoBERTa预训练代码，tensorflow和torch两种版本实现.zip

BERT模型：双向预训练与NLP未来的关键

RoBERTa中文预训练模型发布：提升自然语言处理能力

NLP面试必备：预训练模型全面解析

迁移学习：利用预训练模型快速建立自己的神经网络

Roberta 预训练模型如何下载

简绍一下BERT，还有GPT、XLNet、RoBERTa等预训练方法

最新资源

自然语言处理-基于预训练模型的方法笔记