生成预训练提升自然语言理解：半监督方法的突破

版权申诉

5星 · 超过95%的资源 179 浏览量更新于2024-08-11 收藏 528KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

pre-trained language or machine translation model as auxiliary features while training a supervised

model on the target task. This involves a substantial amount of new parameters for each separate

target task, whereas we require minimal changes to our model architecture during transfer.

Auxiliary training objectives

Adding auxiliary unsupervised training objectives is an alternative

form of semi-supervised learning. Early work by Collobert and Weston [

] used a wide variety of

auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling

to improve semantic role labeling. More recently, Rei [

] added an auxiliary language modeling

objective to their target task objective and demonstrated performance gains on sequence labeling

tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training

already learns several linguistic aspects relevant to target tasks.

3 Framework

Our training procedure consists of two stages. The ﬁrst stage is learning a high-capacity language

model on a large corpus of text. This is followed by a ﬁne-tuning stage, where we adapt the model to

a discriminative task with labeled data.

3.1 Unsupervised pre-training

Given an unsupervised corpus of tokens

U = {u

, . . . , u

}

, we use a standard language modeling

objective to maximize the following likelihood:

(U) =

log P (u

i−k

, . . . , u

i−1

; Θ) (1)

where

is the size of the context window, and the conditional probability

is modeled using a neural

network with parameters Θ. These parameters are trained using stochastic gradient descent [51].

In our experiments, we use a multi-layer Transformer decoder [

] for the language model, which is

a variant of the transformer [

]. This model applies a multi-headed self-attention operation over the

input context tokens followed by position-wise feedforward layers to produce an output distribution

over target tokens:

= UW

+ W

= transformer_block(h

l−1

)∀i ∈ [1, n]

P (u) = softmax(h

)

(2)

where

U = (u

−k

, . . . , u

−1

)

is the context vector of tokens,

is the number of layers,

is the token

embedding matrix, and W

is the position embedding matrix.

3.2 Supervised ﬁne-tuning

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target

task. We assume a labeled dataset

, where each instance consists of a sequence of input tokens,

, . . . , x

, along with a label

. The inputs are passed through our pre-trained model to obtain

the ﬁnal transformer block’s activation

, which is then fed into an added linear output layer with

parameters W

to predict y:

P (y|x

, . . . , x

) = softmax(h

). (3)

This gives us the following objective to maximize:

(x,y )

log P (y|x

, . . . , x

). (4)

We additionally found that including language modeling as an auxiliary objective to the ﬁne-tuning

helped learning by (a) improving generalization of the supervised model, and (b) accelerating

convergence. This is in line with prior work [

], who also observed improved performance with

such an auxiliary objective. Speciﬁcally, we optimize the following objective (with weight λ):

Overall, the only extra parameters we require during ﬁne-tuning are

, and embeddings for delimiter

tokens (described below in Section 3.3).

剩余11页未读，继续阅读

方案互联

粉丝: 18
资源: 926

生成预训练提升自然语言理解：半监督方法的突破

Improving Language Understanding by Generative Pre-Training

ChatGPT的原理分析

Generative Pre-Training是什么

Generative Pre-Training

Generative Pre-Training中文是什么

說明Generative Pre-training Transformer

Generative Pre-trained Transformer

Generative Pre-Trainin

GPT (Generative Pre-trained Transformer):

Generative Pre-trained Transformer是什么

解释一下 Generative Pre-trained Transformer

Generative Pre-trained Transformer中文

Contrast Language-Image Pre-Training，CLIP是什么

TAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing 怎么训练数据，给出示例代码

gpt3 和 4 有什么区别

GPT-2和GPT的区别

比较出名的人工智能模型有哪些

awk命令求2023-05-29 15:22:10.845 2023-05-30 15:12:11.747时间差值

poly-encoders: architectures and pre-training strategies for fast and accura

最新资源