BERT模型的参数高效迁移学习策略

需积分: 0 20 浏览量更新于2024-08-05 收藏 707KB PDF 举报

“Parameter-Efficient Transfer Learning for NLP - 迁移学习+BERT1” 在自然语言处理（NLP）领域，迁移学习已经成为一种有效的技术，它通过预先训练的大型模型来提升下游任务的性能。尤其是BERT（Bidirectional Encoder Representations from Transformers）模型，由于其强大的语义理解和生成能力，已经在多个NLP任务中取得了显著成果。然而，针对每个新任务进行完整的微调过程（fine-tuning）是参数效率低下的，因为这需要为每个任务训练一个全新的模型，占用大量计算资源。为了解决这个问题，研究者提出了参数高效的迁移学习方法——适配器模块（Adapter Modules）。适配器模块提供了一种紧凑且可扩展的模型架构，它们只针对每个任务添加少量的可训练参数，而且在添加新任务时，无需重新训练先前的任务。原始网络的参数保持固定，实现了高度的参数共享，从而降低了存储和计算的需求。在实践中，研究者将BERT模型应用于26个不同的文本分类任务，包括GLUE（General Language Understanding Evaluation）基准测试。通过使用适配器，他们能够在保持接近最先进的性能的同时，仅对每个任务增加少量参数。在GLUE基准上，适配器模块的表现与完整微调的性能差距不超过0.4%，而增加的参数量仅为任务总参数的3.6%。相比之下，传统的微调方法会为每个任务引入大量的额外参数，这在处理多个任务时显得效率低下。适配器模块的引入，不仅提高了模型在多任务环境中的效率，还允许模型在不牺牲性能的前提下，更加灵活地适应新的任务需求。这对于资源有限的环境或者需要处理众多任务的场景尤其有价值。此外，适配器的引入也为持续学习和模型更新提供了新的可能性，使得模型可以随着新数据和新任务的出现而持续优化，而不会对已有的学习成果造成重大影响。参数高效的迁移学习方法，如适配器模块，为NLP领域的模型泛化和资源管理带来了革命性的变化。它降低了大规模预训练模型的使用门槛，同时保持了高精度，是未来NLP研究和应用的一个重要方向。

Parameter-Efﬁcient Transfer Learning for NLP

Neil Houlsby

Andrei Giurgiu

1 *

Stanisław Jastrze¸bski

2 *

Bruna Morrone

Quentin de Laroussilhe

Andrea Gesmundo

Mona Attariyan

Sylvain Gelly

Abstract

Fine-tuning large pre-trained models is an effec-

tive transfer mechanism in NLP. However, in the

presence of many downstream tasks, ﬁne-tuning

is parameter inefﬁcient: an entire new model is

required for every task. As an alternative, we

propose transfer with adapter modules. Adapter

modules yield a compact and extensible model;

they add only a few trainable parameters per task,

and new tasks can be added without revisiting

previous ones. The parameters of the original

network remain ﬁxed, yielding a high degree of

parameter sharing. To demonstrate adapter’s ef-

fectiveness, we transfer the recently proposed

BERT Transformer model to

diverse text clas-

siﬁcation tasks, including the GLUE benchmark.

Adapters attain near state-of-the-art performance,

whilst adding only a few parameters per task. On

GLUE, we attain within

0.4%

of the performance

of full ﬁne-tuning, adding only

3.6%

parameters

per task. By contrast, ﬁne-tuning trains

100%

the parameters per task.

1. Introduction

Transfer from pre-trained models yields strong performance

on many NLP tasks (Dai & Le, 2015; Howard & Ruder,

2018; Radford et al., 2018). BERT, a Transformer network

trained on large text corpora with an unsupervised loss,

attained state-of-the-art performance on text classiﬁcation

and extractive question answering (Devlin et al., 2018).

In this paper we address the online setting, where tasks

arrive in a stream. The goal is to build a system that per-

forms well on all of them, but without training an entire new

model for every new task. A high degree of sharing between

tasks is particularly useful for applications such as cloud

services, where models need to be trained to solve many

Equal contribution

Google Research

Jagiellonian University.

Correspondence to: Neil Houlsby <neilhoulsby@google.com>.

Proceedings of the

International Conference on Machine

Learning, Long Beach, California, PMLR 97, 2019. Copyright

2019 by the author(s).

Num trainable parameters / task

−25

−20

−15

−10

−5

Accuracy delta (%)

Adapters (ours)

Fine-tune top layers

Figure 1.

Trade-off between accuracy and number of trained task-

speciﬁc parameters, for adapter tuning and ﬁne-tuning. The y-axis

is normalized by the performance of full ﬁne-tuning, details in

Section 3. The curves show the

th,

th, and

th performance

percentiles across nine tasks from the GLUE benchmark. Adapter-

based tuning attains a similar performance to full ﬁne-tuning with

two orders of magnitude fewer trained parameters.

tasks that arrive from customers in sequence. For this, we

propose a transfer learning strategy that yields compact and

extensible downstream models. Compact models are those

that solve many tasks using a small number of additional

parameters per task. Extensible models can be trained in-

crementally to solve new tasks, without forgetting previous

ones. Our method yields a such models without sacriﬁcing

performance.

The two most common transfer learning techniques in NLP

are feature-based transfer and ﬁne-tuning. Instead, we

present an alternative transfer method based on adapter

modules (Rebufﬁ et al., 2017). Features-based transfer in-

volves pre-training real-valued embeddings vectors. These

embeddings may be at the word (Mikolov et al., 2013), sen-

tence (Cer et al., 2019), or paragraph level (Le & Mikolov,

2014). The embeddings are then fed to custom downstream

models. Fine-tuning involves copying the weights from a

pre-trained network and tuning them on the downstream

task. Recent work shows that ﬁne-tuning often enjoys better

performance than feature-based transfer (Howard & Ruder,

2018).

下载后可阅读完整内容，剩余9页未读，立即下载

FelaniaLiu

粉丝: 31
资源: 332

BERT模型的参数高效迁移学习策略

Adaptive Multi-Task Transfer 文献总结翻译

Hands-On Transfer Learning with Python

Transfer-Learning-NLP:进行迁移学习以进行情感分析的有效性的实验

nlp 自然语言处理+bert model +问答系统 question answer system

Adaptive Multi-Task Transfer Learning

KDD+2019+—+Deep+Learning+for+NLP+with+TensorFlow.pdf

bert-multitask-learning：用于多任务学习的BERT

深度学习/NLP + BERT-CRF + 实体识别 + 医学糖尿病数据命名实体识别

Thesis-Argument-Mining-Transfer-Learning

Exploring transfer learning for NLP 探索NLP的转学-数据集

最新资源