大模型自我提升：无监督学习增强推理能力

人工智能

需积分: 1 154 浏览量更新于2024-06-25 收藏 532KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"大型语言模型可以自我改进，通过无监督的数据集和自动生成的高置信度推理增强答案，改善其推理能力。" 在当前的AI领域，大型语言模型（LLMs）已经在各种任务中展现出卓越的表现。这些模型通常经过大规模预训练，能够理解和生成人类语言，但在执行特定任务时，通常需要通过微调（fine-tuning）来适应新任务，这往往需要大量的标注数据。然而，人类可以通过自我思考和反思来提升自己的推理能力，无需外部输入。这篇论文的作者们提出了一种新的方法，表明大型语言模型也具有自我改进的能力。他们利用预训练的LLM，通过链式思维（Chain-of-Thought）提示和自我一致性（self-consistency）技术，生成对未标注问题的“高置信度”推理增强答案。具体来说，模型被用来解决一系列未标记的问题，并生成解释其答案的逻辑步骤。这些生成的答案不仅被用于解决当前问题，还作为目标输出用于模型的微调。通过这种自我生成的解决方案，研究者们证明了他们的方法可以显著提升540亿参数的LLM的一般推理能力。实验结果显示，在GSM8K、DROP、OpenBookQA和ANLI-A3等推理任务上，模型的性能均有显著提升。例如，GSM8K上的准确率从74.4%提高到82.1%，DROP从78.2%提高到83.0%，OpenBookQA从90.0%提高到94.4%，而ANLI-A3则从63.4%提升到67.9%。这些提升表明，这种方法使得模型不仅在特定任务上表现更好，而且提升了泛化推理能力，达到最先进的水平。这项工作对于AI和自然语言处理（NLP）领域具有重大意义，因为它降低了对大量标注数据的依赖，提高了模型的自我学习能力。随着这种方法的进一步发展和优化，未来可能能够构建出更强大、更具自主性的AI系统，它们能够持续学习和改进，而无需人类的密集参与。这对于推动人工智能向更高级别的自主性和智能水平迈进具有重要的理论和实践价值。

资源详情

资源推荐

model self-improve. In the following sections, we detail important designs within our method, along

with additional approaches for the model to self-improve without supervised data.

3.1 GENERATING AND FILTERING MULTIPLE REASONING PATHS

Self-consistency (Wang et al., 2022b) brings large improvements on reasoning tasks (e.g., 56.5% →

74.4% on GSM8K test set), and the gap between greedy decoding and diverse decoding shows

there is a potential for further improving the reasoning ability of M, using the self-selected high-

conﬁdence reasoning paths as training data.

Question: Stefan goes to a restaurant with his family. They order an appetizer that costs $10 and

4 entrees that are $20 each. If they tip 20% of the total, what is the total amount of money that

they spend?

Multiple Path Decoding:

Output 1: The appetizer costs $10. The entrees cost $20 each so in total 4 * $20 = $80. This means

the total cost is $10 + $80 = $90. They tip 20% of it, so the total amount they spend is $90 * 1.2

= $108. The answer is 108.

Output 2: The appetizer costs $10 and the entrees are $20 each. There are 4 entrees so the sum is

$20 * 4 = $80. The waiter gets 20% of the total. 20% of $80 is $80 * .2 = $16. The answer is $80

+ $16 = $96. (Incorrect reasoning path)

Output 3: The appetizer costs $10. The entrees cost 4 * $20 = $80. The tip is 20% of the total, so

it is 20% of the $90 they have spent. The tip is 0.2 * 90 = $18. The total they spent is $90 + $18

= $108. The answer is 108.

Table 1: Examples of 3 self-generated CoT reasoning paths given a question. Output 1 and 3 are the

most consistent reasoning paths based on majority voting and kept as self-training data.

For each training question x

, we sample m CoT reasoning paths, denoted as {r

, r

, . . . , r

}

(see Table 1 for examples). Since M is prompted with the CoT examples from Wei et al.

(2022b), we apply the same output parsing with “The answer is” to generate their predicted an-

swers {y

, y

, . . . , y

}. The most consistent answer, which is not necessarily a correct answer,

is selected by majority voting, denoted as ˜y

= arg max

k=1

I(y

= y

). For all the train-

ing questions, we ﬁlter the CoT reasoning paths that reach ˜y as the ﬁnal answer to be put into the

self-training data, denoted as D

self−consistent

= {x

}, where

= {r

|1 ≤ j ≤ m, y

= ˜y

0.0 0.2 0.4 0.6 0.8 1.0

Confidence

0.0

0.2

0.4

0.6

0.8

1.0

Accuracy

200

400

# of Questions

Figure 2: The relation of accu-

racy and conﬁdence of the majority-

voted answer after multiple path de-

coding on GSM8K training-set ques-

tions. Predicted conﬁdence from self-

consistency (Wang et al., 2022b) is well

calibrated (Guo et al., 2017).

Since we do not use any ground truth labels to ﬁlter out

cases where ˜y

6= y

, it is important that the self-generated

CoT reasoning paths are mostly reliable and incorrect an-

swers do not hurt the self-improvement of the model. We

plot the relation between the accuracy and conﬁdence of

self-generated CoT paths for each question in GSM8K

training set in Fig. 2. The conﬁdence is the number of

CoT paths leading to ˜y divided by the total path number

m. The y-axis shows the accuracy of ˜y under a certain

conﬁdence. The circle area and the color darkness shows

the number of questions under a certain conﬁdence. We

can observe that conﬁdent answers are more likely to be

correct, which means that when a question has many con-

sistent CoT paths, then the corresponding ˜y is more likely

to be correct. On the other hand, when ˜y is wrong, it is

likely to be supported by fewer CoT paths, and brings lit-

tle noise to the training samples.

剩余18页未读，继续阅读

IT徐师兄

粉丝: 1972
资源: 2689

会员权益专享

大模型自我提升：无监督学习增强推理能力

gpt3-Language-Models-are-Few-Shot-Learners.pdf

Large Language Models are Zero-Shot Reasoners.pdf

Python库 | large-image-source-mapnik-1.11.3.dev17.tar.gz

认证协议内射性的语法判据及循环性质验证 - C.J.F.克雷默斯等 - 2005.

使用gcc-arm-10.2-2020.11-x86_64-aarch64-none-elf.tar.xz交叉编译qt4.8.7流程

mysql5.7.25是否已经取消了my-default.cnf

wget https://huggingface.co/gpt2-large/resolve/main/pytorch_model.bin -O ./model/pytorch_model.binwget https://huggingface.co/gpt2-large/resolve/main/config.json -O ./model/config.json这串代码怎么运行

LLM-2307.04346.pdf

TS 里引入fast-buffer.ts库 再详细讲解如何使用

element-variables.scss的作用是什么

Local-to-Global Self-Attention in Vision Transformers

bert_config.json在哪下载

python编程爬取这段代码里的src

git-lfs.com

https://platform.openai.com/docs/models/gpt-4

能给我20篇关于这个方面的参考文献吗

会员权益专享

最新资源

TS 里引入fast-buffer.ts库再详细讲解如何使用