上下文学习揭秘：大型语言模型如何仅凭示范就能高效学习

需积分: 0 85 浏览量更新于2024-06-16 收藏 9.51MB PDF 举报

"这篇论文深入探讨了在上下文学习（In-Context Learning）中的关键因素，特别是对于大型语言模型（LMs）如何利用少量示范进行新任务的学习和预测。研究发现，模型并不依赖真实标签，而是重视示范的结构和文本分布。此外，元训练对于模型利用示范的能力有显著影响。论文基于广泛的实验，覆盖了多种任务和模型类型，提供了对In-Context Learning工作原理的深刻见解。" 《Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?》这篇论文由多位来自华盛顿大学、MetaAI和艾伦人工智能研究所的研究人员共同撰写，他们对In-Context Learning的机制进行了深入研究。这项技术让模型能够在特定环境下，通过互动学习和提升其性能，特别适合处理语言理解等任务。首先，论文强调了In-Context Learning的有效性，大型语言模型可以通过少量的示范执行新任务，并在多种任务中展现出优于零样本学习的表现。这表明模型具有强大的泛化能力，能从有限的示例中快速学习。其次，研究揭示了一个意外的发现：示范中的真实标签并非必不可少。即使随机更换标签，模型的性能仍能保持稳定，这暗示模型可能不依赖于输入-标签的具体对应关系，而是关注其他方面，比如语境和模式。接着，论文指出示范的关键在于它们提供了对标签空间的理解，输入文本的分布信息，以及序列的结构。这些因素使得模型能有效地从上下文中学习，而不仅仅是模仿输入-标签映射。此外，元训练的角色不容忽视。元训练过的模型在处理示范时，更侧重于利用示范的结构而非精确的标签信息。这表明元训练可能引导模型关注更抽象的特征，增强其适应性和灵活性。实验部分，研究者在12种不同模型（包括GPT-3系列）上进行实验，并在26个数据集（涵盖情感分析、释义检测、自然语言推理等多个领域）上评估性能，进一步验证了他们的理论发现。最后，论文讨论了模型在测试时的学习动态，模型容量的影响，以及与指示性学习的关系。这一研究为理解In-Context Learning的内在机制提供了宝贵的信息，对于未来开发更高效、更适应性强的机器学习模型具有重要的指导意义。

模型。我们总共实验了12个模型。我们包括了6个语

⾔模型（表1），它们都是解码器仅的、密集的LMs。

我们使⽤每个LM进⾏两种推理⽅法，直接和通道，遵

循Min等⼈（2021a）的⽅法。LMs的⼤⼩从774M到

175B不等。我们包括了当时最⼤的密集LM（GPT-3）

和最⼤的公开发布的密集LM（fairseq 13B），还包

括了MetaICL，它从GPT-2 Large初始化，然后在⼀

系列有监督数据集上进⾏元训练，⽬标是上下⽂学习，

我们确保我们的评估数据集与元训练时使⽤的数据集

不重叠。

(3) meta-training with an in-context learning objec-

tive (Min et al., 2021b) magniﬁes these effects—the

models almost exclusively exploit simpler aspects

of the demonstrations like the format rather than

the input-label mapping.

In summary, our analysis provides a new way

of understanding the role of the demonstrations in

in-context learning. We empirically show that the

model (1) counter-intuitively does not rely on the

ground truth input-label mapping provided in the

demonstrations as much as we thought (Section 4),

and (2) nonetheless still beneﬁts from knowing the

label space and the distribution of inputs speciﬁed

by the demonstrations (Section 5). We also include

a discussion of broader implications, e.g., what we

can say about the model learning at test time, and

avenues for future work (Section 6).

2 Related Work

Large language models have been key to strong per-

formance in a wide range of downstream tasks (De-

vlin et al., 2019; Radford et al., 2019; Liu et al.,

2019; Raffel et al., 2020; Lewis et al., 2020). While

ﬁnetuning has been a popular approach to transfer

to new tasks (Devlin et al., 2019), it is often imprac-

tical to ﬁnetune a very large model (e.g.



10B pa-

rameters). Brown et al. (2020) propose in-context

learning as an alternative way to learn a new task.

As depicted in Figure 2, the LM learns a new task

via inference alone by conditioning on a concatena-

tion of the training data as demonstrations, without

any gradient updates.

In-context learning has been the focus of signif-

icant study since its introduction. Prior work pro-

poses better ways of formulating the problem (Zhao

et al., 2021; Holtzman et al., 2021; Min et al.,

2021a), better ways of choosing labeled exam-

ples for the demonstrations (Liu et al., 2021; Lu

et al., 2021; Rubin et al., 2021), meta-training

with an explicit in-context learning objective (Chen

et al., 2021; Min et al., 2021b), and learning to

follow instructions as a variant of in-context learn-

ing (Mishra et al., 2021b; Efrat and Levy, 2020;

Wei et al., 2022a; Sanh et al., 2022). At the

same time, some work reports brittleness and over-

sensitivity for in-context learning (Lu et al., 2021;

Zhao et al., 2021; Mishra et al., 2021a).

Relatively less work has been done to understand

why in-context learning works. Xie et al. (2022)

provide theoretical analysis that in-context learn-

ing can be formalized as Bayesian inference that

Circulation revenue has increased by 5% in Finland. \n Positive

Panostaja did not disclose the purchase price. \n Neutral

Paying off the national debt will be extremely painful. \n Negative

The acquisition will have an immediate positive impact. \n ________

Demonstrations

Positive

Test input

Prediction

Figure 2: An overview of in-context learning. The

demonstrations consist of k input-label pairs from the

training data (k =3in the ﬁgure).

Model # Params Public Meta-trained

GPT-2 Large 774M 3 7

MetaICL 774M 3 3

GPT-J 6B 3 7

fairseq 6.7B

†

6.7B 3 7

fairseq 13B

†

13B 3 7

GPT-3 175B

‡

Table 1: A list of LMs used in the experiments:

GPT-2 (Radford et al., 2019), MetaICL (Min et al.,

2021b), GPT-J (Wang and Komatsuzaki, 2021), fairseq

LMs (Artetxe et al., 2021) and GPT-3 (Brown et al.,

2020). ‘Public’ indicates whether the model weights

are public; ‘Meta-trained’ indicates whether the model

is meta-trained with an in-context learning objective.

†

We use dense models in Artetxe et al. (2021) and re-

fer them as fairseq LMs for convenience.

‡

We use the

Davinci API (the base version, not the instruct version)

and assume it to be 175B, following Gao et al. (2021)

and Artetxe et al. (2021).

uses the demonstrations to recover latent concepts.

Razeghi et al. (2022) show that in-context learn-

ing performance is highly correlated with term fre-

quencies in the pretraining data. To the best of our

knowledge, this paper is the ﬁrst that provides an

empirical analysis that investigates why in-context

learning achieves performance gains over zero-shot

inference. We ﬁnd that the ground truth input-label

mapping in the demonstrations has only a marginal

effect, and measure the impact of ﬁner-grained as-

pects of the demonstrations.

3 Experimental Setup

We describe the experimental setup used in our

analysis (Section 4 and 5).

Models.

We experiment with 12 models in to-

tal. We include 6 language models (Table 1), all

of which are decoder-only, dense LMs. We use

each LM with two inference methods, direct and

channel, following Min et al. (2021a). The sizes

of LMs vary from 774M to 175B. We include the

Figure 3: Results when using no-demonstrations, demonstrations with gold labels, and demonstrations with ran-

dom labels in classiﬁcation (top) and multi-choice tasks (bottom). The ﬁrst eight models are evaluated on 16

classiﬁcation and 10 multi-choice datasets, and the last four models are evaluated on 3 classiﬁcation and 3 multi-

choice datasets. See Figure 11 for numbers comparable across all models. Model performance with random

labels is very close to performance with gold labels (more discussion in Section 4.1).

largest dense LM (GPT-3) and the largest publicly

released dense LM (fairseq 13B) at the time of con-

ducting experiments. We also include MetaICL,

which is initialized from GPT-2 Large and then

meta-trained on a collection of supervised datasets

with an in-context learning objective, and ensure

that our evaluation datasets do not overlap with

those used at meta-training time.

Evaluation Data.

We evaluate on 26 datasets,

including sentiment analysis, paraphrase detection,

natural language inference, hate speech detection,

question answering, and sentence completion (full

list and references provided in Appendix A).

All

datasets are classiﬁcation and multi-choice tasks.

We use these datasets because they (1) are true

low-resource datasets with less than 10K train-

ing examples, (2) include well-studied bench-

marks from GLUE (Wang et al., 2018) and Super-

GLUE (Wang et al., 2019a), and (3) cover diverse

domains including science, social media, ﬁnance,

and more.

Other Details.

We use

k = 16

examples as

demonstrations by default for all experiments in

the paper, unless otherwise speciﬁed. Examples

are sampled at uniform from the training data.

We choose a set of

training examples using

5 different random seeds and run experiments 5

times. For fairseq 13B and GPT-3, due to lim-

ited resources, we experiment with a subset of 6

For convenience, we use ‘labels’ to refer to the output for

the task, though our datasets include non-classiﬁcation tasks.

datasets

and 3 random seeds. We report Macro-

for classiﬁcation tasks and Accuracy for multi-

choice tasks. We compute per-dataset average over

seeds, and then report macro-average over datasets.

We use the minimal templates in forming an in-

put sequence from an example. We refer to Ap-

pendix B for more details. All experiments are

reproducible from

github.com/Alrope123/

rethinking-demonstrations.

4 Ground Truth Matters Little

4.1 Gold labels vs. random labels

To see the impact of correctly-paired inputs and

labels in the demonstrations—which we call the

ground truth input-label mapping—we compare the

following three methods.

No demonstrations

is a typical zero-shot method

that does not use any labeled data. A prediction

is made via

argmax

y2C

P (y|x)

, where

is the test

input and

is a small discrete set of possible labels.

Demonstrations w/ gold labels

are used in a typi-

cal in-context learning method with

labeled ex-

amples

)...(x

)

. A concatenation of

input-label pairs is used to make a prediction via

argmax

y2C

P (y|x

...x

,x).

Three classiﬁcation and three multi-choice: MRPC, RTE,

Tweet_eval-hate, OpenbookQA, CommonsenseQA, COPA.

Known to be better for imbalanced classes.

Without loss of generality, all methods in Section 4 and 5

are described based on the direct method, but can be trivially

converted to the channel method by ﬂipping x and y.

剩余29页未读，继续阅读

CSPhD-winston-杨帆

粉丝: 2976
资源: 58

上下文学习揭秘：大型语言模型如何仅凭示范就能高效学习

Rethinking the design of the Internet:The end to end arguments vs. the brave new world

Rethinking the Heatmap Regression for Bottom-Up Human Pose

rethinking graph anomaly detection: a self-supervised group discrimination p

rethinking the inception architecture for computer vision

hybridnets的参考文献

Traceback (most recent call last): File "C:\yangxinru\Rethinking_of_PAR-master\train_1.py", line 72, in <module> for i, (inputs, labels) in enumerate(train_loader): ValueError: not enough values to unpack (expected 2, got 1)

a Tensor with 559 elements cannot be converted to Scalar

有哪些关于卷积神经网络研究现状的参考文献

Middle Frontal Gyrus 与情绪加工有何关系，请具体说明，并给出参考文献

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

最新资源