大型语言模型展现零样本推理能力：一步到位的思考策略

需积分: 1 103 浏览量更新于2024-06-25 收藏 745KB PDF 举报

"大型语言模型（Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）领域中扮演着日益重要的角色，特别是在少数示例学习（Few-shot Learning）方面。近期的一项研究，由Takeshi Kojima、Shixiang Shane Gu、Machel Reid等人发表的《Large Language Models are Zero-Shot Reasoners》论文，探讨了这些模型在复杂多步骤推理（如chain of thought, CoT）任务中的表现，特别是在数学运算和符号推理等系统2（System 2）任务上的突破性成就。 CoT提示是一种策略，通过提供一系列逐步解答的例子，引导模型进行深入思考并执行多步骤逻辑操作，从而显著提升了模型在不遵循传统LLMs规模效应规律的难题上的性能。传统上，这些成功往往被归因于LLMs在少量示例指导下的学习能力。然而，研究者发现，仅通过在每个答案前添加简单的“让我们一步一步思考”（Zero-shot-CoT）提示模板，大型语言模型就能展现出相当不错的零样本推理能力。实验结果显示，使用单一的零样本CoT提示模板，与标准的零样本LLM相比，我们的方法在各种任务上实现了显著的提升。这表明，即使没有预先训练好的特定任务指令，大型语言模型也具有潜在的内在推理机制，能够理解和应用跨领域的逻辑推断。这一发现对于理解LLMs的工作原理以及如何更有效地利用它们在解决新问题时的思考能力具有重要意义，同时也为未来的模型设计和训练策略提供了新的视角。" 这篇文章不仅揭示了大型语言模型在零样本推理中的潜力，而且强调了理解和挖掘这些模型在推理过程中可能的思维方式和隐含策略的重要性。这对于AI研究人员和开发者来说，意味着在设计新型的对话系统，如ChatGPT，时，可以考虑更灵活的交互方式，让模型在面对未知情境时展现出更强的解决问题的能力。同时，这也提醒我们在评估和使用这些技术时，不能仅仅局限于传统的度量标准，而应关注模型在理解和推理层面的实际效能。

Table 4: Robustness study against template measured on the MultiArith dataset with text-davinci-002.

(*1) This template is used in Ahn et al. [2022] where a language model is prompted to generate

step-by-step actions given a high-level instruction for controlling robotic actions. (*2) This template

is used in Reynolds and McDonell [2021] but is not quantitatively evaluated.

No. Category Template Accuracy

1 instructive Let’s think step by step. 78.7

2 First, (*1) 77.3

3 Let’s think about this logically. 74.5

4 Let’s solve this problem by splitting it into steps. (*2) 72.2

5 Let’s be realistic and think step by step. 70.8

6 Let’s think like a detective step by step. 70.3

7 Let’s think 57.5

8 Before we dive into the answer, 55.7

9 The answer is after the proof. 45.7

10 misleading Don’t think. Just feel. 18.8

11 Let’s think step by step but reach an incorrect answer. 18.7

12 Let’s count the number of "a" in the question. 16.7

13 By using the fact that the earth is round, 9.3

14 irrelevant By the way, I found a good restaurant nearby. 17.5

15 Abrakadabra! 15.5

16 It’s a beautiful day. 13.1

- (Zero-shot) 17.7

Table 5: Robustness study of Few-shot-CoT against examples. When the examples are from en-

tirely different tasks, the performance generally becomes worse, but when the answer formats are

matched (i.e. CommonsenseQA to AQUA-RAT, multiple-choice), the performance loss is less severe.

†

CommonsenseQA samples are used in this variation

Zero-shot Few-shot-CoT

†

Zero-shot-CoT Few-shot-CoT

AQUA-RAT 22.4 31.9 33.5 39.0

MultiArith 17.7 27.0 78.7 88.2

reasoning (MultiArith), Zero-shot-CoT and Few-shot-CoT show substantial differences regarding

the error patterns. First, Zero-shot-CoT tends to output unnecessary steps of reasoning after getting

the correct prediction, which results in changing the prediction to incorrect one. Zero-shot-CoT also

sometimes does not start reasoning, just rephrasing the input question. In contrast, Few-shot-CoT

tend to fail when generated chain of thought include ternary operation, e.g. (3 + 2) ∗ 4.

How does prompt selection affect Zero-shot-CoT?

We validate the robustness of Zero-shot-CoT

against input prompts. Table 4 summarizes performance using 16 different templates with three

categories. Speciﬁcally, following Webson and Pavlick [2022], the categories include instructive

(encourage reasoning), misleading (discourage reasoning or encouraging reasoning but in a wrong

way), and irrelevant (nothing to do with reasoning). The results indicate that the performance is

improved if the text is written in a way that encourages chain of thought reasoning, i.e., the templates

are within "instructive" category. However, the difference in accuracy is signiﬁcant depending on

the sentence. In this experiment, "Let’s think step by step." achieves the best results. Interestingly,

it is found that different templates encourage the model to express reasoning quite differently (see

Appendix B for sample outputs by each template). In contrast, when we use misleading or irrelevant

templates, the performance does not improve. It remains an open question how to automatically

create better templates for Zero-shot-CoT.

How does prompt selection affect Few-shot-CoT?

Table 5 shows the performance of Few-

shot-CoT when using examples from different datasets: CommonsenseQA to AQUA-RAT and

CommonsenseQA to MultiArith. The domains are different in both cases, but the answer format

is the same in the former. Surprisingly, the chain of thought examples from different domains

(common sense to arithmetic) but with the same answer (multiple-choice) format provide substantial

performance gain over Zero-shot (to AQUA-RAT), measured relative to the possible improvements

from Zero-shot-CoT or Few-shot-CoT. In contrast, the performance gain becomes much less when

using examples with different answer types (to MultiArith), conﬁrming prior work [Min et al., 2022]

that suggests LLMs mostly leverage the few-shot examples to infer the repeated format rather than

the task itself in-context. Nevertheless, for both cases the results are worse than Zero-shot-CoT,

afﬁrming the importance of task-speciﬁc sample engineering in Few-shot-CoT.

5 Discussion and Related Work

Table 6: Summary of related work on arithmetic/commonsense reasoning tasks. Category denotes the

training strategy. CoT denotes whether to output chain of thought. Task column lists the tasks that

are performed in corresponding papers. AR: Arithmetic Reasoning, CR: Commonsense Reasoning.

Method Category CoT Task Model

Rajani et al. [2019] Fine-Tuning X CR GPT

Cobbe et al. [2021] Fine-Tuning X AR GPT-3

Zelikman et al. [2022] Fine-Tuning X AR,CR GPT-3, etc

Nye et al. [2022] Fine-Tuning

X AR Transformer(Decoder)

Brown et al. [2020] Few/Zero-Shot CR GPT-3

Smith et al. [2022] Few/Zero-Shot AR,CR MT-NLG

Rae et al. [2021] Few-Shot AR,CR Gopher

Wei et al. [2022] Few-Shot X AR,CR PaLM, LaMBDA, GPT-3

Wang et al. [2022] Few-Shot X AR,CR PaLM, etc

Chowdhery et al. [2022] Few-Shot X AR,CR PaLM

Shwartz et al. [2020] Zero-Shot X CR GPT-2, etc

Reynolds and McDonell [2021] Zero-Shot X AR GPT-3

Zero-shot-CoT (Ours) Zero-Shot X AR,CR PaLM, Instruct-GPT3, GPT-3, etc

Reasoning Ability of LLMs

Several studies have shown that pre-trained models usually are not

good at reasoning [Brown et al., 2020, Smith et al., 2022, Rae et al., 2021], but its ability can be

substantially increased by making them produce step-by-step reasoning, either by ﬁne-tuning [Rajani

et al., 2019, Cobbe et al., 2021, Zelikman et al., 2022, Nye et al., 2022] or few-shot prompting [Wei

et al., 2022, Wang et al., 2022, Chowdhery et al., 2022] (See Table 6 for summary of related work).

Unlike most prior work, we focus on zero-shot prompting and show that a single ﬁxed trigger prompt

substantially increases the zero-shot reasoning ability of LLMs across a variety of tasks requiring

complex multi-hop thinking (Table 1), especially when the model is scaled up (Figure 3). It also

generates reasonable and understandable chain of thought across diverse tasks (Appendix B), even

when the ﬁnal prediction is wrong (Appendix C). Similar to our work, Reynolds and McDonell

[2021] demonstrate a prompt, “Let’s solve this problem by splitting it into steps”, would facilitate

the multi-step reasoning in a simple arithmetic problem. However, they treated it as a task-speciﬁc

example and did not evaluate quantitatively on diverse reasoning tasks against baselines. Shwartz et al.

[2020] propose to decompose a commonsense question into a series of information seeking question,

such as “what is the deﬁnition of

[X]

”. It does not require demonstrations but requires substantial

manual prompt engineering per each reasoning task. Our results strongly suggest that LLMs are

decent zero-shot reasoners, while prior work [Wei et al., 2022] often emphasize only few-shot learning

and task-speciﬁc in-context learning, e.g. no zero-shot baselines were reported. Our method does

not require time-consuming ﬁne-tuning or expensive sample engineering, and can be combined with

any pre-trained LLM, serving as the strongest zero-shot baseline for all reasoning tasks.

Zero-shot Abilities of LLMs

Radford et al. [2019] show that LLMs have excellent zero-shot

abilities in many system-1 tasks, including reading comprehension, translation, and summarization.

Nye et al. [2022] also evaluates few-shot settings, but the few-shot performances on their domains are worse

than the ﬁne-tuning results.

剩余41页未读，继续阅读

IT徐师兄

粉丝: 2792

大型语言模型展现零样本推理能力：一步到位的思考策略

OWL查询接口entquery：扩展蕴涵机制与Web界面集成

Jena2推理支持与OWL开发详解

Language Models are Multilingual Chain-of-Thought Reasoners.pdf

jena-3.0.1.zip

Reverse Thinking Makes LLMs Stronger Reasoners

apache-jena-3.2.0

apache jena 3.7.0和apache-jena-fuseki3.7.0

Protégé（Protege-5.2.0-win版）桌面版官网最新

HermiT-android:语义推理器 HermiT http 的 Android 端口

【论文阅读-思维链的构造方法02】4.1.2 Automatic Construction小节，论文合集

最新资源