苹果发布30B参数多模态大模型MM1：架构与预训练洞察

124 浏览量更新于2024-06-15 收藏 17.93MB PDF 举报

本文档探讨了苹果公司在多模态大型语言模型（Multimodal Large Language Models, MLLMs）领域的最新进展，以论文《MM1:Methods, Analysis & Insights from Multimodal LLM Pre-training》的形式公开。随着多模态生成技术在人工智能（AI）市场中的火热发展，尤其是OpenAI的Sora项目，苹果公司决定加入这一竞争，推出了一个拥有30亿参数的高性能多模态大模型系列。研究的核心关注点在于构建高效且表现优秀的多模态模型，这涉及到模型架构的关键组成部分和数据选择的重要性。论文作者团队，包括多位核心和资深作者，对图像编码器、视觉语言连接器以及不同类型的预训练数据进行了细致而全面的分析和比较。他们发现，对于大规模多模态预训练，混合使用图像描述数据、交错的图像-文本数据和纯文本数据是至关重要的，这对于达到当前最佳性能状态至关重要。具体来说，他们强调了以下设计要点： 1. **图像编码器**：通过深入研究，团队发现优化的图像编码器对于模型理解和整合不同模态信息至关重要，它影响了模型对视觉信息的理解和处理能力。 2. **视觉语言连接器**：连接器的设计决定了模型如何在文本和视觉元素之间建立联系，一个高效的连接器能够促进跨模态知识的融合。 3. **数据多样性**：混合数据策略有助于模型学习更全面的语言模式和上下文理解，避免了单一数据类型可能导致的偏见或局限性。 4. **文本与图像的交互**：交替的图像-文本和文本-图像数据增强，使得模型能够在处理单独模态时也能理解它们之间的关系，从而提升整体性能。 5. **预训练数据的质量和量**：高质量的图像-文本配对和多样化的数据源对于模型的泛化能力和迁移学习效果有着显著影响。 6. **模型规模**：30亿参数的大规模模型在多模态任务上展现出强大的潜能，但也带来了更大的计算挑战和对数据的要求。通过这些分析和实证研究，苹果展示了其在多模态大模型开发上的方法论和技术洞察，这不仅揭示了构建高效多模态模型的策略，也为其他研究者和开发者提供了有价值的参考。未来，我们可以期待苹果在这一领域继续探索，推动多模态技术的创新和发展。

8 B. McKinzie et al.

models. After instruction tuning, all three architectures achieve very similar re-

sults at the 336px and 114 token setting. (See Appendix Figure 10 for ﬁne-tuning

results.)

3.3 Pre-training Data Ablation

Large-scale and task-appropriate data is of paramount importance in training

performant models. Typically, models are trained in two stages, pre-training and

instruction tuning. In the former stage web-scale data is used while in the latter

stage task-speciﬁc curated data is utilized. In the following, we focus on the

pre-training stage and elaborate our data choices (see Figure 3, right).

Data Type Sources Size

Captioned Images

CC3M [101], CC12M [13], HQIPT-204M [94],

2B image-text pairs

COYO [11], Web Image-Text-1B (Internal)

Captioned Images (Synthetic) VeCap [56] 300M image-text pairs

Interleaved Image-Text OBELICS [58], Web Interleaved (Internal) 600M documents

Text-only

Webpages, Code, Social media,

2T tokens

Books, Encyclopedic, Math

Table 2: List of datasets for pre-training multimodal large language models.

Two types of data are commonly used to train MLLMs: captioning data

consisting of images with paired text descriptions; and interleaved image-text

documents from the web (see Appendix A.1 for details). Note that captioning

data tends to contain relatively short text with high relevance to the image.

On the contrary, interleaved data has substantially longer and more diverse text

with less relevance, on average, to the surrounding images. Finally, we include

text-only data to help preserve the language understanding capabilities of the

underlying LLM. The full list of datasets is summarized in Table 2.

We use the same model setup for ablations described in Section 3.1, with the

only exception that we train 200k steps here to fully leverage the large-scale data

training. We also incorporate a set of commonly employed text tasks, referred

to as TextCore

, as part of the evaluation to better assess the eﬀects of data

mixture. These lead to the following lessons:

Data lesson 1: interleaved data is instrumental for few-shot and text-

only performance, while captioning data lifts zero-shot performance.

In Figure 5a, we present results across diﬀerent mixes of interleaved and cap-

tioned data. Zero-shot performance increases consistently, from 25.8% to 39.3%,

as we increase the amount of captioned data. At the same time, however, for

4- and 8-shot performance, having at least 50% of the data being interleaved is

crucial to maintain over 61% for 8-shot or 58% for 4-shot. Without it, perfor-

mance drops drastically to 45% and 43.7%, respectively. Since interleaved data

naturally contains multiple images and accompanying text which are often inter-

related, such data is inherently similar to few-shot test inputs, which aligns well

TextCore tasks include ARC [22], PIQA [7], LAMBADA [89], WinoGrande [97],

HellaSWAG [128], SciQ [118], TriviaQA [50], and WebQS [6].

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 9

TextCore 0-shot 4-shot 8-shot

49.6

39.3

43.8

51.7

35.9

61.1

52.2

33.4

58.7

62.2

33.1

58.2

61.9

52.8

25.8

53.6

56.9

Average Performance

100/0 66/33 50/50 33/66 0/100

(a) Caption/Interleaved Mixing

+ 0.1pt+ 0.1pt + 0.1pt

TextCore 0-shot 4-shot 8-shot

49.6

39.3

43.8

54.8

35.3

51.4

53.6

52.8

25.8

58.7

56.9

54.5

51.6

55.3

Average Performance

Caption Caption+Text

Interleaved Interleaved+Text

(b) Importance of Text Only Data

+ 0.1pt+ 0.1pt + 0.1pt

TextCore 0-shot 4-shot 8-shot

52.2

33.4

58.7

62.2

32.1

58.3

62.7

54.2

32.5

57.9

60.8

54.6

32.1

57.1

Average Performance

100/0 91/9 86/14 66/33

+ 0.1pt+ 0.1pt + 0.1pt

TextCore 0-shot 4-shot 8-shot

53.9

35.4

55.9

58.7

32.1

58.3

62.7

Average Performance

w/o VeCap w/ VeCap

(d) Impact of VeCap Data

Fig. 5: Data Ablations. For each ablation, we present four diﬀerent metrics: TextCore,

0-shot, 4-shot, and 8-shot. (a) Results with image data where we present ﬁve diﬀerent

mixing ratios between interleaved and captioned data. (b) Results with and without

text-only data. We mix the text-only data separately with captioned and interleaved

data. (c) Results with diﬀerent mixing ratios between image data (caption and inter-

leaved) and text-only data. (d) Results with and without including VeCap as part of

caption data.

with empirical results. However, due to the nature of common evaluation being

heavily tailored to captioning problems (3 out of the 8 benchmarks are caption-

ing), captioning data notably lifts zero-shot performance. Interestingly, the use

of interleaved data further boosts performance on these very same captioning

benchmarks in few-shot settings. Similarly, text-only performance beneﬁts from

interleaved data, likely as interleaved data contains long-form text as well.

Data lesson 2: text-only data helps with few-shot and text-only per-

formance. We utilize text-only data as a way to maintain the language under-

standing capabilities of the model. As seen in Figure 5b, combining text-only

and captioned data boost few-shot performance. In other words, long text does

allow the model to utilize multiple image and text examples as context to per-

form better question answering and captioning. On the other side, combining

text-only with interleaved data leads to a drop in performance, albeit a minor

剩余40页未读，继续阅读

灿烂李

粉丝: 392
资源: 115

苹果发布30B参数多模态大模型MM1：架构与预训练洞察

mm1排队论仿真模型：自动输出排队分析结果

掌握 MM1K 模拟技巧：使用 Simpack 进行高效建模

MM1 OPNET实验：构建M/M/1队列模型详解

苹果MM1: Methods, Analysis&Insights from Multimodal LLM Pre-traini

模拟MM1：实施模型MM1-Modelagem eSimulação

MM1:MM1仿真代码

mm1:一个简单的 angularJS 练习

排队论模型程序mm1.zip_mm1排队论模型_mm1模拟程序_排队M_排队模型_排队论mm1

排队模型:gui界面，模拟排队模型-matlab开发

mm1_simulator:这个米文件包含 mm1 队列模型的模拟器。-matlab开发

最新资源