没有合适的资源?快使用搜索试试~ 我知道了~
首页生成预训练提升自然语言理解:半监督方法的突破
"《通过生成式预训练提升语言理解》 本文由Alec Radford、Karthik Narasimhan、Tim Salimans和Ilya Sutskever等人来自OpenAI团队撰写,主要探讨了如何利用生成式预训练技术来改进自然语言理解(NLP)的性能。传统的NLP任务包括文本蕴含、问题回答、语义相似度评估和文档分类等,这些任务往往依赖于大量的无标注文本数据,但针对特定任务的标注数据却相对匮乏,这使得歧视性训练模型在执行这类任务时面临挑战。 作者提出了一种新的策略,即首先采用无监督的方式对语言模型进行大规模预训练,通过这种方法,模型能够学习到更加普遍和通用的表征,这些表征可以适应各种各样的语言结构和语义。与以往的方法不同,他们并未单纯地将预训练模型的特征提取出来用于特定的有监督模型,而是选择直接在预训练模型的基础上进行细粒度的调整,也就是通过任务导向的输入变换来进行区分性微调,这样既能保持模型架构的简洁,又能实现有效的迁移学习。 在实验部分,文章展示了这种方法在广泛的语言理解基准测试上所取得的巨大优势,包括显著提高了文本蕴含、问答系统、语义相似度评估和文档分类等任务的表现。这种生成式预训练与细粒度任务适应性的结合,有效地解决了数据稀缺情况下模型泛化能力的问题,证明了其在实际应用中的巨大潜力。 总结来说,本文的核心贡献在于提供了一种新颖的NLP框架,通过生成式预训练和任务导向的微调,显著提升了模型在自然语言理解领域的表现,为解决无监督学习与有监督学习之间的平衡问题提供了一个有力的解决方案。这一成果对于推动NLP领域的发展具有重要意义,并可能在未来的工作中被广泛应用。"
资源详情
资源推荐
pre-trained language or machine translation model as auxiliary features while training a supervised
model on the target task. This involves a substantial amount of new parameters for each separate
target task, whereas we require minimal changes to our model architecture during transfer.
Auxiliary training objectives
Adding auxiliary unsupervised training objectives is an alternative
form of semi-supervised learning. Early work by Collobert and Weston [
10
] used a wide variety of
auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling
to improve semantic role labeling. More recently, Rei [
50
] added an auxiliary language modeling
objective to their target task objective and demonstrated performance gains on sequence labeling
tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training
already learns several linguistic aspects relevant to target tasks.
3 Framework
Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.
3.1 Unsupervised pre-training
Given an unsupervised corpus of tokens
U = {u
1
, . . . , u
n
}
, we use a standard language modeling
objective to maximize the following likelihood:
L
1
(U) =
X
i
log P (u
i
|u
i−k
, . . . , u
i−1
; Θ) (1)
where
k
is the size of the context window, and the conditional probability
P
is modeled using a neural
network with parameters Θ. These parameters are trained using stochastic gradient descent [51].
In our experiments, we use a multi-layer Transformer decoder [
34
] for the language model, which is
a variant of the transformer [
62
]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:
h
0
= UW
e
+ W
p
h
l
= transformer_block(h
l−1
)∀i ∈ [1, n]
P (u) = softmax(h
n
W
T
e
)
(2)
where
U = (u
−k
, . . . , u
−1
)
is the context vector of tokens,
n
is the number of layers,
W
e
is the token
embedding matrix, and W
p
is the position embedding matrix.
3.2 Supervised fine-tuning
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset
C
, where each instance consists of a sequence of input tokens,
x
1
, . . . , x
m
, along with a label
y
. The inputs are passed through our pre-trained model to obtain
the final transformer block’s activation
h
m
l
, which is then fed into an added linear output layer with
parameters W
y
to predict y:
P (y|x
1
, . . . , x
m
) = softmax(h
m
l
W
y
). (3)
This gives us the following objective to maximize:
L
2
(C) =
X
(x,y )
log P (y|x
1
, . . . , x
m
). (4)
We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [
50
,
43
], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):
L
3
(C) = L
2
(C) + λ ∗ L
1
(C) (5)
Overall, the only extra parameters we require during fine-tuning are
W
y
, and embeddings for delimiter
tokens (described below in Section 3.3).
3
剩余11页未读,继续阅读
方案互联
- 粉丝: 18
- 资源: 926
下载权益
电子书特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- OptiX传输试题与SDH基础知识
- C++Builder函数详解与应用
- Linux shell (bash) 文件与字符串比较运算符详解
- Adam Gawne-Cain解读英文版WKT格式与常见投影标准
- dos命令详解:基础操作与网络测试必备
- Windows 蓝屏代码解析与处理指南
- PSoC CY8C24533在电动自行车控制器设计中的应用
- PHP整合FCKeditor网页编辑器教程
- Java Swing计算器源码示例:初学者入门教程
- Eclipse平台上的可视化开发:使用VEP与SWT
- 软件工程CASE工具实践指南
- AIX LVM详解:网络存储架构与管理
- 递归算法解析:文件系统、XML与树图
- 使用Struts2与MySQL构建Web登录验证教程
- PHP5 CLI模式:用PHP编写Shell脚本教程
- MyBatis与Spring完美整合:1.0.0-RC3详解
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功