大型语言模型的涌现能力：预测与意外现象

需积分: 1 138 浏览量更新于2024-06-25 收藏 615KB PDF 举报

"《大规模语言模型的涌现能力》（Emergent Abilities of Large Language Models）是一篇发表在2022年8月的《机器学习研究交易》上的论文，由来自Google Research、斯坦福大学、UNC教堂山分校和DeepMind的研究者联合撰写。该研究主要关注的是随着语言模型规模的显著提升，除预期内的性能改进和样本效率提高外，观察到的一些未曾预料到的能力提升现象，即所谓的“涌现能力”。论文作者探讨了这些大型语言模型在训练过程中展现出的新颖和独特能力，这些能力并非预先设计或编程，而是通过海量数据和复杂的学习过程自然进化出来的。这些涌现能力可能包括但不限于：创造性的表达、跨领域的理解、解决问题的创新方法、以及潜在的抽象思考能力。这些发现挑战了我们对语言模型固有认知的局限，表明它们可能具备了超越简单模式匹配的高级智能。值得注意的是，论文在OpenReview平台上公开接受同行评审，链接为<https://openreview.net/forum?id=yzkSU5zdwD>，这意味着研究人员对这一新兴现象进行了深入的探讨，并邀请了其他专家进行评估和反馈，以确保研究的严谨性和可靠性。文章的核心贡献在于识别和分析这种涌现能力的本质，讨论其对人工智能领域的影响，以及未来可能的研究方向。它不仅提供了对现有技术的深刻洞察，也为理解和控制大型语言模型的行为，以及开发更智能的AI系统提供了新的思考角度。整体上，这篇论文对深度学习社区和人工智能领域的发展具有重要的理论价值和实践意义。"

Published in Transactions on Machine Learning Research (08/2022)

Table 1: List of emergent abilities of large language models and the scale (both training FLOPs and number

of model parameters) at which the abilities emerge.

Emergent scale

Train. FLOPs Params. Model Reference

Few-shot prompting abilities

Addition/subtraction (3 digit) 2.3E+22 13B GPT-3 Brown et al. (2020)

Addition/subtraction (4-5 digit) 3.1E+23 175B

MMLU Benchmark (57 topic avg.) 3.1E+23 175B GPT-3 Hendrycks et al. (2021a)

Toxicity classiﬁcation (CivilComments) 1.3E+22 7.1B Gopher Rae et al. (2021)

Truthfulness (Truthful QA) 5.0E+23 280B

MMLU Benchmark (26 topics) 5.0E+23 280B

Grounded conceptual mappings 3.1E+23 175B GPT-3 Patel & Pavlick (2022)

MMLU Benchmark (30 topics) 5.0E+23 70B Chinchilla Hoﬀmann et al. (2022)

Word in Context (WiC) benchmark 2.5E+24 540B PaLM Chowdhery et al. (2022)

Many BIG-Bench tasks (see Appendix E) Many Many Many BIG-Bench (2022)

Augmented prompting abilities

Instruction following (ﬁnetuning) 1.3E+23 68B FLAN Wei et al. (2022a)

Scratchpad: 8-digit addition (ﬁnetuning) 8.9E+19 40M LaMDA Nye et al. (2021)

Using open-book knowledge for fact checking 1.3E+22 7.1B Gopher Rae et al. (2021)

Chain of thought: Math word problems 1.3E+23 68B LaMDA Wei et al. (2022b)

Chain of thought: StrategyQA 2.9E+23 62B PaLM Chowdhery et al. (2022)

Diﬀerentiable search index 3.3E+22 11B T5 Tay et al. (2022)

Self-consistency decoding 1.3E+23 68B LaMDA Wang et al. (2022b)

Leveraging explanations in prompting 5.0E+23 280B Gopher Lampinen et al. (2022)

Least-to-most prompting 3.1E+23 175B GPT-3 Zhou et al. (2022)

Zero-shot chain of thought reasoning 3.1E+23 175B GPT-3 Kojima et al. (2022)

Calibration via P(True) 2.6E+23 52B Anthropic Kadavath et al. (2022)

5 Discussion

We have seen that a range of abilities—in the few-shot prompting setup or otherwise—have thus far only

been observed when evaluated on a suﬃciently large language model. Hence, their emergence cannot be

predicted by simply extrapolating performance on smaller-scale models. Emergent few-shot prompted tasks

are also unpredictable in the sense that these tasks are not explicitly included in pre-training, and we likely

do not know the full scope of few-shot prompted tasks that language models can perform. This raises the

question of whether further scaling could potentially endow even-larger language models with new emergent

abilities. Tasks that language models cannot currently do are prime candidates for future emergence; for

instance, there are dozens of tasks in BIG-Bench for which even the largest GPT-3 and PaLM models do not

achieve above-random performance (see Appendix E.4).

The ability for scale to unpredictably enable new techniques is not just theoretical. Consider the Word in

Context (WiC) benchmark (Pilehvar & Camacho-Collados, 2019) shown in Figure 2H, as a historical example.

Here, scaling GPT-3 to around 3

training FLOPs (175B parameters) failed to unlock above-random

one-shot prompting performance.

Regarding this negative result, Brown et al. (2020) cited the model

architecture of GPT-3 or the use of an autoregressive language modeling objective (rather than using a

denoising training objective) as potential reasons, and suggested training a model of comparable size with

bidirectional architecture as a remedy. However, later work found that further scaling a decoder-only language

model was actually enough to enable above-random performance on this task. As is shown in Figure 2H,

scaling PaLM (Chowdhery et al., 2022) from 3

training FLOPs (62B parameters) to 3

training

FLOPs (540B parameters) led to a signiﬁcant jump in performance, without the signiﬁcant architectural

changes suggested by Brown et al. (2020).

GPT-3 does achieve slightly above-random performance on the dev set with few-shot instead of one-shot prompting (

∼

55%),

but this above-random performance did not appear to be a result of scale and did not hold on the test set server.

剩余29页未读，继续阅读

IT徐师兄

粉丝: 2112
资源: 2689

大型语言模型的涌现能力：预测与意外现象

Emergent Abilities of Large Language Models.pdf

emergent_comm_rl:再现“符号和像素输入的引用游戏的语言交流的出现”中使用的强化学习模型

Scott.Snyder.Emergent.CodeChallenge

Emergent.Design - The Evolutionary Nature of Professional Software Development-Mar.2008-Addison Wesley

Phillips, W. M., Jr. (1981). The school sociologist: A need and an emergent profession. Washington, D.C.: University Press of America, 261 pp., [dollar]12.95 (paper)

Emergent Design：The Evolutionary Nature of Professional Software Development

Eco-physiological characteristics of three emergent macrophyte species to concentration changes of COD, nitrogen and phosphorus

Emergent Design: The Evolutionary Nature of Professional Software Development by Scott L. Bain (Addison-Wesley Professional)

a evolution of emergent computation

机器人活性物质.pdf

最新资源