多模态思维链提升大语言模型复杂推理能力

需积分: 0 156 浏览量更新于2024-06-26 收藏 827KB PDF 举报

本文档《Multimodal Chain-of-Thought Reasoning in Language Models》由亚马逊的研究团队提出，针对大型语言模型（LLMs）在处理复杂推理任务时的局限性进行深入探讨。传统的链-of-thought（CoT）方法主要依赖于文本（语言）模式，通过生成中间的推理链条来辅助答案的推断。然而，研究者意识到仅限于语言模态可能限制了模型的推理能力。该论文创新地引入了Multimodal-CoT（多模态思维链），这是一个两阶段的框架，旨在整合语言和视觉（图像）模态。首先，这个框架将理性生成（rationale generation）与答案推理（answer inference）分离，允许模型利用更丰富的多模态信息生成更准确的推理链条。这种方法的优势在于，通过结合文本和图像，模型能够获取更全面的信息支持，从而提高推理的准确性和深度。作者Zhuosheng Zhang、Aston Zhang、Mu Li、Hai Zhao、George Karypis和Alex Smola合作，展示了他们基于超过10亿参数的语言模型在科学问答基准（ScienceQA）上的显著提升。与之前最先进的LLM（如GPT-3.5）相比，Multimodal-CoT的准确性从75.17%提升到了91.68%，甚至超过了人类的表现，显示出在处理跨模态问题时的巨大潜力。论文的主要贡献包括： 1. **多模态融合**：通过将文本理解和图像理解结合，拓宽了推理的视角，提高了模型的理解深度。 2. **两阶段架构**：通过先生成理由再进行答案推理，减少了信息处理的冗余，提高了效率。 3. **性能提升**：实验结果表明，Multimodal-CoT在多项任务上超越了先前的最佳实践，证明了多模态思考对于复杂推理任务的显著优势。 4. **开源代码**：研究人员分享了模型和代码，为后续研究者提供了可复现和扩展的基础。这篇论文不仅展示了在语言模型中引入多模态思维链的重要性和实用性，也为未来AI领域的研究者提供了一个改进现有模型，特别是在处理需要深度理解和综合分析的任务时，如何更好地利用多模态信息的参考框架。

Multimodal Chain-of-Thought Reasoning in Language Models

Generated Rationale: Magnets can pull or push on each other

without touching. When magnets attract, they pull together. When

magnets repel, they push apart. Whether a magnet attracts or repels

other magnets depends on the positions of its poles, or ends. Every

magnet has two poles, called north and south. Here are some

examples of magnets. The north pole of each magnet is marked N,

and the south pole is marked S. If different poles are closest to each

other, the magnets attract. The magnets in the pair below attract. If the

same poles are closest to each other, the magnets repel. The magnets

in both pairs below repel. Will these magnets attract or repel? To find

out, look at which poles are closest to each other. The south pole of

one magnet is closest to the south pole of the other magnet. Poles that

are the same repel. So, these magnets will repel each other.

Answer: The answer is (B).

Options:

(B) repel

(A) attract

Problem

Baseline

Question: Will these magnets attract or repel each other?

Context: Two magnets are placed as shown. Hint: Magnets that attract pull together. Magnets that repel push apart.

Gold Rationale: Magnets can pull or push on each other without touching. When magnets attract, they pull together. When magnets repel, they push

apart. Whether a magnet attracts or repels other magnets depends on the positions of its poles, or ends. Every magnet has two poles, called north and

south. Here are some examples of magnets. The north pole of each magnet is marked N, and the south pole is marked S. If different poles are closest

to each other, the magnets attract. The magnets in the pair below attract. If the same poles are closest to each other, the magnets repel. The magnets

in both pairs below repel. Will these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet

is closest to the south pole of the other magnet. Poles that are different attract. So, these magnets will attract each other.

Answer: The answer is (A).

Generated Rationale: Magnets can pull or push on each other

without touching. When magnets attract, they pull together. When

magnets repel, they push apart. Whether a magnet attracts or repels

other magnets depends on the positions of its poles, or ends. Every

magnet has two poles, called north and south. Here are some

examples of magnets. The north pole of each magnet is marked N,

and the south pole is marked S. If different poles are closest to each

other, the magnets attract. The magnets in the pair below attract. If the

same poles are closest to each other, the magnets repel. The magnets

in both pairs below repel. Will these magnets attract or repel? To find

out, look at which poles are closest to each other. The north pole of

one magnet is closest to the south pole of the other magnet. Poles that

are different attract. So, these magnets will attract each other.

Answer: The answer is (A).

+ Vision Features

Vision

Figure 2.

Example of the two-stage framework without vision features (baseline) and with vision features (ours) for generating rationales

and predicting answers. The upper part presents the problem details with a gold rationale, and the lower part shows the outputs of the

baseline and our method incorporated with vision features. We observe that the baseline fails to predict the right answer due to the

misleading by hallucinated rationales. More examples are shown in Appendix A.1.

3.2. Misleading by Hallucinated Rationales

To dive into how the rationales affect the answer prediction,

we separate the CoT problem into two stages, rationale

generation and answer inference. We report the RougeL

score and accuracy for the rationale generation and answer

inference, respectively. Table 3 shows the results based

on the two-stage framework. Although the two-stage base-

line model achieves a 91.76 RougeL score of the rationale

generation, the answer inference accuracy is only 70.53%.

Compared with the QCM

→

A variant (80.40%) in Table 2,

the result shows that the generated rationale in the two-stage

framework does not improve answer accuracy.

Table 3.

Two-stage setting of (i) rationale generation (RougeL) and

(ii) answer inference (Accuracy).

Method (i) QCM→ R (ii) QCMR→ A

Two-Stage Framework 91.76 70.53

w/ Captions 91.85 71.12

w/ Vision Features 96.97 84.91

Then, we randomly sample 50 error cases and ﬁnd that the

model tends to generate hallucinated rationales that mislead

the answer inference. As an example shown in Figure 2, the

model (left part) hallucinates that, “The south pole of one

magnet is closest to the south pole of the other magnet”, due

to the lack of reference to the vision content. We ﬁnd that

such mistakes occur at a ratio of 64% among the error cases

Others

(36%)

Resolved

(62.5%)

Unresolved

(37.5%)

Hallucination

(64%)

(a) ratio of hallucination mistakes

(b) correction rate w/ vision features

Figure 3.

The ratio of hallucination mistakes (a) and correction

rate w/ vision features (b).

(Figure 3(a)).

3.3. Multimodality Contributes to Effective Rationales

We speculate that such a phenomenon of hallucination is

due to a lack of necessary vision contexts for performing

effective Multimodal-CoT. To inject vision information, a

simple way is to transform the paired image into a caption

(Lu et al., 2022a) and then append the caption in the input of

both stages. However, as shown in Table 3, using captions

only yields marginal performance gains (

↑

0.59%). Then,

we explore an advanced technique by incorporating vision

features into the language model. Concretely, we feed the

paired image to the DETR model (Carion et al., 2020) to

extract vision features. Then we fuse the vision features

剩余18页未读，继续阅读

未来在这儿

粉丝: 4131
资源: 264

多模态思维链提升大语言模型复杂推理能力

Language Models are Multilingual Chain-of-Thought Reasoners.pdf

unimodal models and multimodal models

Multimodal MRI-based classification of Alzheimer's disease using hierarchical convolutional neural networks全文

Can you find out the datasets used in this paper "Knowledge-driven Egocentric Multimodal Activity Recognition"

Multimodal-GPT

Knowledge-driven Egocentric Multimodal Activity Recognition

什么是MUSIQ 评价指标

基于多参数PET-CT的多模态列线图评估肺癌淋巴结转移 翻译成英语

能推荐几篇遗传算法优化的文献吗

最新资源

基于多参数PET-CT的多模态列线图评估肺癌淋巴结转移翻译成英语