模型。我们总共实验了12个模型。我们包括了6个语
⾔模型(表1),它们都是解码器仅的、密集的LMs。
我们使⽤每个LM进⾏两种推理⽅法,直接和通道,遵
循Min等⼈(2021a)的⽅法。LMs的⼤⼩从774M到
175B不等。我们包括了当时最⼤的密集LM(GPT-3)
和最⼤的公开发布的密集LM(fairseq 13B),还包
括了MetaICL,它从GPT-2 Large初始化,然后在⼀
系列有监督数据集上进⾏元训练,⽬标是上下⽂学习,
我们确保我们的评估数据集与元训练时使⽤的数据集
不重叠。
(3) meta-training with an in-context learning objec-
tive (Min et al., 2021b) magnifies these effects—the
models almost exclusively exploit simpler aspects
of the demonstrations like the format rather than
the input-label mapping.
In summary, our analysis provides a new way
of understanding the role of the demonstrations in
in-context learning. We empirically show that the
model (1) counter-intuitively does not rely on the
ground truth input-label mapping provided in the
demonstrations as much as we thought (Section 4),
and (2) nonetheless still benefits from knowing the
label space and the distribution of inputs specified
by the demonstrations (Section 5). We also include
a discussion of broader implications, e.g., what we
can say about the model learning at test time, and
avenues for future work (Section 6).
2 Related Work
Large language models have been key to strong per-
formance in a wide range of downstream tasks (De-
vlin et al., 2019; Radford et al., 2019; Liu et al.,
2019; Raffel et al., 2020; Lewis et al., 2020). While
finetuning has been a popular approach to transfer
to new tasks (Devlin et al., 2019), it is often imprac-
tical to finetune a very large model (e.g.
10B pa-
rameters). Brown et al. (2020) propose in-context
learning as an alternative way to learn a new task.
As depicted in Figure 2, the LM learns a new task
via inference alone by conditioning on a concatena-
tion of the training data as demonstrations, without
any gradient updates.
In-context learning has been the focus of signif-
icant study since its introduction. Prior work pro-
poses better ways of formulating the problem (Zhao
et al., 2021; Holtzman et al., 2021; Min et al.,
2021a), better ways of choosing labeled exam-
ples for the demonstrations (Liu et al., 2021; Lu
et al., 2021; Rubin et al., 2021), meta-training
with an explicit in-context learning objective (Chen
et al., 2021; Min et al., 2021b), and learning to
follow instructions as a variant of in-context learn-
ing (Mishra et al., 2021b; Efrat and Levy, 2020;
Wei et al., 2022a; Sanh et al., 2022). At the
same time, some work reports brittleness and over-
sensitivity for in-context learning (Lu et al., 2021;
Zhao et al., 2021; Mishra et al., 2021a).
Relatively less work has been done to understand
why in-context learning works. Xie et al. (2022)
provide theoretical analysis that in-context learn-
ing can be formalized as Bayesian inference that
Circulation revenue has increased by 5% in Finland. \n Positive
Panostaja did not disclose the purchase price. \n Neutral
Paying off the national debt will be extremely painful. \n Negative
The acquisition will have an immediate positive impact. \n ________
=
Demonstrations
LM
Positive
Test input
Prediction
Figure 2: An overview of in-context learning. The
demonstrations consist of k input-label pairs from the
training data (k =3in the figure).
Model # Params Public Meta-trained
GPT-2 Large 774M 3 7
MetaICL 774M 3 3
GPT-J 6B 3 7
fairseq 6.7B
†
6.7B 3 7
fairseq 13B
†
13B 3 7
GPT-3 175B
‡
77
Table 1: A list of LMs used in the experiments:
GPT-2 (Radford et al., 2019), MetaICL (Min et al.,
2021b), GPT-J (Wang and Komatsuzaki, 2021), fairseq
LMs (Artetxe et al., 2021) and GPT-3 (Brown et al.,
2020). ‘Public’ indicates whether the model weights
are public; ‘Meta-trained’ indicates whether the model
is meta-trained with an in-context learning objective.
†
We use dense models in Artetxe et al. (2021) and re-
fer them as fairseq LMs for convenience.
‡
We use the
Davinci API (the base version, not the instruct version)
and assume it to be 175B, following Gao et al. (2021)
and Artetxe et al. (2021).
uses the demonstrations to recover latent concepts.
Razeghi et al. (2022) show that in-context learn-
ing performance is highly correlated with term fre-
quencies in the pretraining data. To the best of our
knowledge, this paper is the first that provides an
empirical analysis that investigates why in-context
learning achieves performance gains over zero-shot
inference. We find that the ground truth input-label
mapping in the demonstrations has only a marginal
effect, and measure the impact of finer-grained as-
pects of the demonstrations.
3 Experimental Setup
We describe the experimental setup used in our
analysis (Section 4 and 5).
Models.
We experiment with 12 models in to-
tal. We include 6 language models (Table 1), all
of which are decoder-only, dense LMs. We use
each LM with two inference methods, direct and
channel, following Min et al. (2021a). The sizes
of LMs vary from 774M to 175B. We include the
Figure 3: Results when using no-demonstrations, demonstrations with gold labels, and demonstrations with ran-
dom labels in classification (top) and multi-choice tasks (bottom). The first eight models are evaluated on 16
classification and 10 multi-choice datasets, and the last four models are evaluated on 3 classification and 3 multi-
choice datasets. See Figure 11 for numbers comparable across all models. Model performance with random
labels is very close to performance with gold labels (more discussion in Section 4.1).
largest dense LM (GPT-3) and the largest publicly
released dense LM (fairseq 13B) at the time of con-
ducting experiments. We also include MetaICL,
which is initialized from GPT-2 Large and then
meta-trained on a collection of supervised datasets
with an in-context learning objective, and ensure
that our evaluation datasets do not overlap with
those used at meta-training time.
Evaluation Data.
We evaluate on 26 datasets,
including sentiment analysis, paraphrase detection,
natural language inference, hate speech detection,
question answering, and sentence completion (full
list and references provided in Appendix A).
1
All
datasets are classification and multi-choice tasks.
We use these datasets because they (1) are true
low-resource datasets with less than 10K train-
ing examples, (2) include well-studied bench-
marks from GLUE (Wang et al., 2018) and Super-
GLUE (Wang et al., 2019a), and (3) cover diverse
domains including science, social media, finance,
and more.
Other Details.
We use
k = 16
examples as
demonstrations by default for all experiments in
the paper, unless otherwise specified. Examples
are sampled at uniform from the training data.
We choose a set of
k
training examples using
5 different random seeds and run experiments 5
times. For fairseq 13B and GPT-3, due to lim-
ited resources, we experiment with a subset of 6
1
For convenience, we use ‘labels’ to refer to the output for
the task, though our datasets include non-classification tasks.
datasets
2
and 3 random seeds. We report Macro-
F1
3
for classification tasks and Accuracy for multi-
choice tasks. We compute per-dataset average over
seeds, and then report macro-average over datasets.
We use the minimal templates in forming an in-
put sequence from an example. We refer to Ap-
pendix B for more details. All experiments are
reproducible from
github.com/Alrope123/
rethinking-demonstrations.
4 Ground Truth Matters Little
4.1 Gold labels vs. random labels
To see the impact of correctly-paired inputs and
labels in the demonstrations—which we call the
ground truth input-label mapping—we compare the
following three methods.
4
No demonstrations
is a typical zero-shot method
that does not use any labeled data. A prediction
is made via
argmax
y2C
P (y|x)
, where
x
is the test
input and
C
is a small discrete set of possible labels.
Demonstrations w/ gold labels
are used in a typi-
cal in-context learning method with
k
labeled ex-
amples
(x
1
,y
1
)...(x
k
,y
k
)
. A concatenation of
k
input-label pairs is used to make a prediction via
argmax
y2C
P (y|x
1
,y
1
...x
k
,y
k
,x).
2
Three classification and three multi-choice: MRPC, RTE,
Tweet_eval-hate, OpenbookQA, CommonsenseQA, COPA.
3
Known to be better for imbalanced classes.
4
Without loss of generality, all methods in Section 4 and 5
are described based on the direct method, but can be trivially
converted to the channel method by flipping x and y.