大规模语言模型的新兴能力研究

需积分: 1 128 浏览量更新于2024-06-25 收藏 823KB PDF 举报

“ Emergent Abilities of Large Language Models.pdf” 这篇论文主要探讨了大型语言模型的“新兴能力”现象。随着语言模型规模的不断扩大，除了在各种下游任务上的性能和样本效率的预测性提升之外，研究者观察到了一些不可预测的新能力。这些新兴能力是指在模型训练过程中，随着参数数量增加，模型突然展现出的、之前未被设计或预期的功能。在传统认知中，语言模型的主要目标是学习语言的统计规律，并能够生成与上下文一致的文本。然而，当模型规模达到一定程度时，它们似乎超越了这个基本任务，表现出更复杂的智能行为。例如，大型语言模型可能具备理解隐喻、执行简单的计算任务、生成代码、进行推理甚至参与对话的能力，这些都是在没有明确训练这些特定任务的情况下出现的。论文作者通过一系列实验来研究这些新兴能力的性质和范围。他们发现，这些能力的出现往往与模型的规模密切相关，更大的模型更有可能展现出更高级别的能力。同时，这些能力的检测和评估也提出了新的挑战，因为它们可能在标准测试集上无法直接体现，需要创新的评估方法。此外，论文还讨论了这些新兴能力对人工智能领域的影响。一方面，它们揭示了预训练模型在无监督学习方面的巨大潜力，可能开启新一波AI技术的应用浪潮。另一方面，也引发了关于模型可解释性、安全性和伦理问题的讨论，因为这些意外的能力可能会导致不可预见的行为。作者来自谷歌研究、斯坦福大学、北卡罗来纳大学教堂山分校和DeepMind等机构，他们在开放审查平台上发布了这篇论文，鼓励同行对其研究进行评审和讨论。这表明，对于大型语言模型的新兴能力的研究是一个活跃且重要的领域，对于推动AI技术和理解人工智能的边界具有深远意义。这篇论文深入研究了大型语言模型在规模扩展后出现的不可预测的智能表现，这些新兴能力不仅拓宽了我们对语言模型潜力的认识，也为未来的AI研究和应用提供了新的方向和挑战。

Published in Transactions on Machine Learning Research (08/2022)

Table 1: List of emergent abilities of large language models and the scale (both training FLOPs and number

of model parameters) at which the abilities emerge.

Emergent scale

Train. FLOPs Params. Model Reference

Few-shot prompting abilities

Addition/subtraction (3 digit) 2.3E+22 13B GPT-3 Brown et al. (2020)

Addition/subtraction (4-5 digit) 3.1E+23 175B

MMLU Benchmark (57 topic avg.) 3.1E+23 175B GPT-3 Hendrycks et al. (2021a)

Toxicity classiﬁcation (CivilComments) 1.3E+22 7.1B Gopher Rae et al. (2021)

Truthfulness (Truthful QA) 5.0E+23 280B

MMLU Benchmark (26 topics) 5.0E+23 280B

Grounded conceptual mappings 3.1E+23 175B GPT-3 Patel & Pavlick (2022)

MMLU Benchmark (30 topics) 5.0E+23 70B Chinchilla Hoﬀmann et al. (2022)

Word in Context (WiC) benchmark 2.5E+24 540B PaLM Chowdhery et al. (2022)

Many BIG-Bench tasks (see Appendix E) Many Many Many BIG-Bench (2022)

Augmented prompting abilities

Instruction following (ﬁnetuning) 1.3E+23 68B FLAN Wei et al. (2022a)

Scratchpad: 8-digit addition (ﬁnetuning) 8.9E+19 40M LaMDA Nye et al. (2021)

Using open-book knowledge for fact checking 1.3E+22 7.1B Gopher Rae et al. (2021)

Chain-of-thought: Math word problems 1.3E+23 68B LaMDA Wei et al. (2022b)

Chain-of-thought: StrategyQA 2.9E+23 62B PaLM Chowdhery et al. (2022)

Diﬀerentiable search index 3.3E+22 11B T5 Tay et al. (2022b)

Self-consistency decoding 1.3E+23 68B LaMDA Wang et al. (2022b)

Leveraging explanations in prompting 5.0E+23 280B Gopher Lampinen et al. (2022)

Least-to-most prompting 3.1E+23 175B GPT-3 Zhou et al. (2022)

Zero-shot chain-of-thought reasoning 3.1E+23 175B GPT-3 Kojima et al. (2022)

Calibration via P(True) 2.6E+23 52B Anthropic Kadavath et al. (2022)

Multilingual chain-of-thought reasoning 2.9E+23 62B PaLM Shi et al. (2022)

Ask me anything prompting 1.4E+22 6B EleutherAI Arora et al. (2022)

5 Discussion

We have seen that a range of abilities—in the few-shot prompting setup or otherwise—have thus far only

been observed when evaluated on a suﬃciently large language model. Hence, their emergence cannot be

predicted by simply extrapolating performance on smaller-scale models. Emergent few-shot prompted tasks

are also unpredictable in the sense that these tasks are not explicitly included in pre-training, and we likely

do not know the full scope of few-shot prompted tasks that language models can perform. This raises the

question of whether further scaling could potentially endow even-larger language models with new emergent

abilities. Tasks that language models cannot currently do are prime candidates for future emergence; for

instance, there are dozens of tasks in BIG-Bench for which even the largest GPT-3 and PaLM models do not

achieve above-random performance (see Appendix E.4).

The ability for scale to unpredictably enable new techniques is not just theoretical. Consider the Word in

Context (WiC) benchmark (Pilehvar & Camacho-Collados, 2019) shown in Figure 2H, as a historical example.

Here, scaling GPT-3 to around 3

training FLOPs (175B parameters) failed to unlock above-random

one-shot prompting performance.

Regarding this negative result, Brown et al. (2020) cited the model

architecture of GPT-3 or the use of an autoregressive language modeling objective (rather than using a

denoising training objective) as potential reasons, and suggested training a model of comparable size with

bidirectional architecture as a remedy. However, later work found that further scaling a decoder-only language

model was actually enough to enable above-random performance on this task. As is shown in Figure 2H,

scaling PaLM (Chowdhery et al., 2022) from 3

training FLOPs (62B parameters) to 3

training

GPT-3 does achieve slightly above-random performance on the dev set with few-shot instead of one-shot prompting (

∼

55%),

but this above-random performance did not appear to be a result of scale and did not hold on the test set server.

剩余29页未读，继续阅读

IT徐师兄

粉丝: 2323
资源: 2862

大规模语言模型的新兴能力研究

emergent_abilities_of_large_la.pdf

how does gpt obtain its ability? tracing emergent abilities of language mode

Phillips, W. M., Jr. (1981). The school sociologist: A need and an emergent profession. Washington, D.C.: University Press of America, 261 pp., [dollar]12.95 (paper)

机器人活性物质.pdf

弱电系统维保方案.pdf

ChatGPT各项能力的起源.pdf

拆解追溯 GPT-3.5 各项能力的起源.pdf

国际与全球商业战略（英文课件）.pdf

Pavilion Competition+Swiss-ster+Sticks.pdf

Scaling, emergence, and reasoning (Jason Wei, Vandy).pdf

最新资源