Published in Transactions on Machine Learning Research (08/2022)
Table 1: List of emergent abilities of large language models and the scale (both training FLOPs and number
of model parameters) at which the abilities emerge.
Emergent scale
Train. FLOPs Params. Model Reference
Few-shot prompting abilities
r
Addition/subtraction (3 digit) 2.3E+22 13B GPT-3 Brown et al. (2020)
r
Addition/subtraction (4-5 digit) 3.1E+23 175B
r
MMLU Benchmark (57 topic avg.) 3.1E+23 175B GPT-3 Hendrycks et al. (2021a)
r
Toxicity classification (CivilComments) 1.3E+22 7.1B Gopher Rae et al. (2021)
r
Truthfulness (Truthful QA) 5.0E+23 280B
r
MMLU Benchmark (26 topics) 5.0E+23 280B
r
Grounded conceptual mappings 3.1E+23 175B GPT-3 Patel & Pavlick (2022)
r
MMLU Benchmark (30 topics) 5.0E+23 70B Chinchilla Hoffmann et al. (2022)
r
Word in Context (WiC) benchmark 2.5E+24 540B PaLM Chowdhery et al. (2022)
r
Many BIG-Bench tasks (see Appendix E) Many Many Many BIG-Bench (2022)
Augmented prompting abilities
r
Instruction following (finetuning) 1.3E+23 68B FLAN Wei et al. (2022a)
r
Scratchpad: 8-digit addition (finetuning) 8.9E+19 40M LaMDA Nye et al. (2021)
r
Using open-book knowledge for fact checking 1.3E+22 7.1B Gopher Rae et al. (2021)
r
Chain-of-thought: Math word problems 1.3E+23 68B LaMDA Wei et al. (2022b)
r
Chain-of-thought: StrategyQA 2.9E+23 62B PaLM Chowdhery et al. (2022)
r
Differentiable search index 3.3E+22 11B T5 Tay et al. (2022b)
r
Self-consistency decoding 1.3E+23 68B LaMDA Wang et al. (2022b)
r
Leveraging explanations in prompting 5.0E+23 280B Gopher Lampinen et al. (2022)
r
Least-to-most prompting 3.1E+23 175B GPT-3 Zhou et al. (2022)
r
Zero-shot chain-of-thought reasoning 3.1E+23 175B GPT-3 Kojima et al. (2022)
r
Calibration via P(True) 2.6E+23 52B Anthropic Kadavath et al. (2022)
r
Multilingual chain-of-thought reasoning 2.9E+23 62B PaLM Shi et al. (2022)
r
Ask me anything prompting 1.4E+22 6B EleutherAI Arora et al. (2022)
5 Discussion
We have seen that a range of abilities—in the few-shot prompting setup or otherwise—have thus far only
been observed when evaluated on a sufficiently large language model. Hence, their emergence cannot be
predicted by simply extrapolating performance on smaller-scale models. Emergent few-shot prompted tasks
are also unpredictable in the sense that these tasks are not explicitly included in pre-training, and we likely
do not know the full scope of few-shot prompted tasks that language models can perform. This raises the
question of whether further scaling could potentially endow even-larger language models with new emergent
abilities. Tasks that language models cannot currently do are prime candidates for future emergence; for
instance, there are dozens of tasks in BIG-Bench for which even the largest GPT-3 and PaLM models do not
achieve above-random performance (see Appendix E.4).
The ability for scale to unpredictably enable new techniques is not just theoretical. Consider the Word in
Context (WiC) benchmark (Pilehvar & Camacho-Collados, 2019) shown in Figure 2H, as a historical example.
Here, scaling GPT-3 to around 3
·
10
23
training FLOPs (175B parameters) failed to unlock above-random
one-shot prompting performance.
3
Regarding this negative result, Brown et al. (2020) cited the model
architecture of GPT-3 or the use of an autoregressive language modeling objective (rather than using a
denoising training objective) as potential reasons, and suggested training a model of comparable size with
bidirectional architecture as a remedy. However, later work found that further scaling a decoder-only language
model was actually enough to enable above-random performance on this task. As is shown in Figure 2H,
scaling PaLM (Chowdhery et al., 2022) from 3
·
10
23
training FLOPs (62B parameters) to 3
·
10
24
training
3
GPT-3 does achieve slightly above-random performance on the dev set with few-shot instead of one-shot prompting (
∼
55%),
but this above-random performance did not appear to be a result of scale and did not hold on the test set server.
6