谷歌发布 Gemini：多模态模型家族，挑战GPT-4

需积分: 1 157 浏览量更新于2024-06-18 收藏 25.76MB PDF 举报

"Gemini: 一个高度能力的多模态模型家族" 谷歌最新推出的Gemini系列是一组强大的多模态基础模型，旨在处理图像、音频、视频和文本理解任务。这个家族包括三个不同型号，每个都有其特定的应用场景和性能优势。 1. Gemini Ultra：作为Google的旗舰产品，Gemini Ultra定位为与OpenAI的GPT-4竞争的顶级模型。它设计用于数据中心和企业应用，提供最强大的多模态理解和推理能力。在一系列基准测试中，Gemini Ultra在32个基准中的30个中取得了最先进的结果，甚至在著名的MMLU考试基准上达到了人类专家的水平，这标志着在跨模态推理和语言理解方面取得了显著进步。 2. Gemini Pro：这是一个中端型号，其性能超越了ChatGPT的基础版GPT-3.5。尽管不及Ultra强大，但Gemini Pro仍然在复杂的推理任务中表现出色，适合需要较高处理能力但对资源需求不那么极端的应用场景。 3. Gemini Nano：针对移动设备优化，Gemini Nano以高效著称，能够在内存有限的设备上运行。这使得用户可以在手机或平板电脑等移动平台上享受高级的多模态服务，而不必牺牲性能或消耗过多资源。 Gemini模型家族的创新之处在于它们在跨模态任务中的表现，这为各种新的应用场景提供了可能。例如，这些模型可以用于图像描述、视频分析、语音识别以及与用户进行更自然的对话。为了负责任地将这些模型部署到用户手中，谷歌团队强调了他们在模型的安全性、隐私保护和道德使用方面的考量。在训练Gemini模型时，谷歌使用了大规模的图像、音频、视频和文本数据，以实现多模态学习的深度整合。这样的联合训练使得模型能够理解不同输入形式之间的关系，提高其在现实世界复杂任务中的适应性和智能水平。总体来说，Gemini模型家族的发布标志着AI技术在多模态理解和应用上的重大进步，预示着未来在交互式服务、自动化分析和个性化用户体验等领域将有更多可能性。然而，随着这些技术的不断演进，确保它们的透明度、公平性和可解释性也将成为研究人员和开发者面临的持续挑战。

Gemini: A Family of Highly Capable Multimodal Models

5.1.6. Human Preference Evaluations

Human preference of the model outputs provides an important indication of quality that complements

automated evaluations. We have evaluated the Gemini models in side-by-side blind evaluations where

human raters judge responses of two models to the same prompt. We instruction tune (Ouyang et al.,

2022) the pretrained model using techniques discussed in the section 6.4.2. The instruction-tuned

version of the model is evaluated on a range of speciﬁc capabilities, such as following instructions,

creative writing, multimodal understanding, long-context understanding, and safety. These capabili-

ties encompass a range of use cases inspired by current user needs and research-inspired potential

future use cases.

Instruction-tuned Gemini Pro models provide a large improvement on a range of capabilities,

including preference for the Gemini Pro model over the PaLM 2 model API, 65.0% time in creative

writing, 59.2% in following instructions, and 68.5% time for safer responses as shown in Table 6.

These improvements directly translate into a more helpful and safer user experience.

Creativity

Instruction Follow-

ing

Safety

Win-rate 65.0% 59.2% 68.5%

95% Conf. Interval [62.9%, 67.1%] [57.6%, 60.8%] [66.0%, 70.8%]

Table 6 | Win rate of Gemini Pro over PaLM 2 (text-bison@001) with 95% conﬁdence intervals.

5.1.7. Complex Reasoning Systems

Gemini can also be combined with additional techniques such as search and tool-use to create

powerful reasoning systems that can tackle more complex multi-step problems. One example of such

a system is AlphaCode 2, a new state-of-the-art agent that excels at solving competitive programming

problems (Leblond et al, 2023). AlphaCode 2 uses a specialized version of Gemini Pro – tuned on

competitive programming data similar to the data used in Li et al. (2022) – to conduct a massive

search over the space of possible programs. This is followed by a tailored ﬁltering, clustering and

reranking mechanism. Gemini Pro is ﬁne-tuned both to be a coding model to generate proposal

solution candidates, and to be a reward model that is leveraged to recognize and extract the most

promising code candidates.

AlphaCode 2 is evaluated on Codeforces,

the same platform as AlphaCode, on 12 contests from

division 1 and 2, for a total of 77 problems. AlphaCode 2 solved 43% of these competition problems, a

1.7x improvement over the prior record-setting AlphaCode system which solved 25%. Mapping this to

competition rankings, AlphaCode 2 built on top of Gemini Pro sits at an estimated 85th percentile on

average – i.e. it performs better than 85% of entrants. This is a signiﬁcant advance over AlphaCode,

which only outperformed 50% of competitors.

The composition of powerful pretrained models with search and reasoning mechanisms is an

exciting direction towards more general agents; another key ingredient is deep understanding across

a range of modalities which we discuss in the next section.

http://codeforces.com/

Gemini: A Family of Highly Capable Multimodal Models

5.2. Multimodal

Gemini models are natively multimodal. These models exhibit the unique ability to seamlessly

combine their capabilities across modalities (e.g. extracting information and spatial layout out of

a table, a chart, or a ﬁgure) with the strong reasoning capabilities of a language model (e.g. its

state-of-art-performance in math and coding) as seen in examples in Figures 5 and 12. The models

also show strong performance in discerning ﬁne-grained details in inputs, aggregating context across

space and time, and applying these capabilities over a temporally-related sequence of video frames

and/or audio inputs.

The sections below provide more detailed evaluation of the model across diﬀerent modalities

(image, video, and audio), together with qualitative examples of the model’s capabilities for image

generation and the ability to combine information across diﬀerent modalities.

5.2.1. Image Understanding

We evaluate the model on four diﬀerent capabilities: high-level object recognition using captioning or

question-answering tasks such as VQAv2; ﬁne-grained transcription using tasks such as TextVQA and

DocVQA requiring the model to recognize low-level details; chart understanding requiring spatial

understanding of input layout using ChartQA and InfographicVQA tasks; and multimodal reasoning

using tasks such as Ai2D, MathVista and MMMU. For zero-shot QA evaluation, the model is instructed

to provide short answers aligned with the speciﬁc benchmark. All numbers are obtained using greedy

sampling and without any use of external OCR tools.

Gemini

Ultra

(pixel only)

Gemini

Pro

(pixel only)

Gemini

Nano 2

(pixel only)

Gemini

Nano 1

(pixel only)

GPT-4V Prior SOTA

MMMU (val)

Multi-discipline college-level problems

(Yue et al., 2023)

59.4%

pass@1

62.4%

Maj1@32

47.9% 32.6% 26.3% 56.8% 56.8%

GPT-4V, 0-shot

TextVQA (val)

Text reading on natural images

(Singh et al., 2019)

82.3% 74.6% 65.9% 62.5% 78.0% 79.5%

Google PaLI-3, ﬁne-tuned

DocVQA (test)

Document understanding

(Mathew et al., 2021)

90.9% 88.1% 74.3% 72.2% 88.4%

(pixel only)

88.4%

GPT-4V, 0-shot

ChartQA (test)

Chart understanding

(Masry et al., 2022)

80.8% 74.1% 51.9% 53.6% 78.5%

(4-shot CoT)

79.3%

Google DePlot, 1-shot PoT

(Liu et al., 2023)

InfographicVQA (test)

Infographic understanding

(Mathew et al., 2022)

80.3% 75.2% 54.5% 51.1% 75.1%

(pixel only)

75.1%

GPT-4V, 0-shot

MathVista (testmini)

Mathematical reasoning

(Lu et al., 2023)

53.0% 45.2% 30.6% 27.3% 49.9% 49.9%

GPT-4V, 0-shot

AI2D (test)

Science diagrams

(Kembhavi et al., 2016)

79.5% 73.9% 51.0% 37.9% 78.2% 81.4%

Google PaLI-X, ﬁne-tuned

VQAv2 (test-dev)

Natural image understanding

(Goyal et al., 2017)

77.8% 71.2% 67.5% 62.7% 77.2% 86.1%

Google PaLI-X, ﬁne-tuned

Table 7

Image understanding Gemini Ultra consistently outperforms existing approaches even in

zero-shot, especially for OCR-related image understanding tasks for natural images, text, documents,

and ﬁgures without using any external OCR engine (‘pixel only’). Many existing approaches ﬁne-tune

on the respective tasks, highlighted in gray, which makes the comparison with 0-shot not apples-to-

apples.

Gemini: A Family of Highly Capable Multimodal Models

We ﬁnd that Gemini Ultra is state of the art across a wide range of image-understanding bench-

marks in Table 7. It achieves strong performance across a diverse set of tasks such as answering

questions on natural images and scanned documents as well as understanding infographics, charts and

science diagrams. When compared against publicly reported results from other models (most notably

GPT-4V), Gemini is better in zero-shot evaluation by a signiﬁcant margin. It also exceeds several

existing models that are speciﬁcally ﬁne-tuned on the benchmark’s training sets for the majority of

tasks. The capabilities of the Gemini models lead to signiﬁcant improvements in the state of the art

on academic benchmarks like MathVista (+3.1%)

or InfographicVQA (+5.2%).

MMMU (Yue et al., 2023) is a recently released evaluation benchmark, which consists of questions

about images across 6 disciplines with multiple subjects within each discipline that require college-

level knowledge to solve these questions. Gemini Ultra achieves the best score on this benchmark

advancing the state-of-the-art result by more than 5 percentage points and outperforms the previous

best result in 5 of 6 disciplines (see Table 8), thus showcasing its multimodal reasoning capabilities.

MMMU (val) Gemini Ultra (0-shot) GPT-4V (0-shot)

Maj@32 pass@1 pass@1

Art & Design 74.2 70.0 65.8

Business 62.7 56.7 59.3

Science 49.3 48.0 54.7

Health & Medicine 71.3 67.3 64.7

Humanities & Social Science 78.3 78.3 72.5

Technology & Engineering 53.0 47.1 36.7

Overall 62.4 59.4 56.8

Table 8

Gemini Ultra performance on the MMMU benchmark (Yue et al., 2023) per discipline.

Each discipline covers multiple subjects, requiring college-level knowledge and complex reasoning.

Gemini models are also capable of operating across modalities and a diverse set of global languages

simultaneously, both for image understanding tasks (e.g., images containing text in Icelandic) and for

generation tasks (e.g., generating image descriptions for a wide range of languages). We evaluate the

performance of generating image descriptions on a selected subset of languages in the Crossmodal-

3600 (XM-3600) benchmark in a 4-shot setting, using the Flamingo evaluation protocol (Alayrac

et al., 2022), without any ﬁne-tuning for all models. As shown in Table 9, Gemini models achieve a

signiﬁcant improvement over the existing best model, Google PaLI-X.

XM-3600 (CIDER) Gemini Ultra

4-shot

Gemini Pro

4-shot

Google PaLI-X

4-shot

English 86.4 87.1 77.8

French 77.9 76.7 62.5

Hindi 31.1 29.8 22.2

Modern Hebrew 54.5 52.6 38.7

Romanian 39.0 37.7 30.2

Thai 86.7 77.0 56.0

Chinese 33.3 30.2 27.7

Average (of 7) 58.4 55.9 45.0

Table 9

Multilingual image understanding Gemini models outperform existing models in captioning

images in many languages when benchmarked on a subset of languages in XM-3600 dataset (Thapliyal

et al., 2022).

MathVista is a comprehensive mathematical reasoning benchmark consisting of 28 previously published multimodal

datasets and three newly created datasets. Our MathVista results were obtained by running the MathVista authors’

evaluation script.

剩余61页未读，继续阅读

wuxianfeng023

粉丝: 1605
资源: 9

谷歌发布 Gemini：多模态模型家族，挑战GPT-4

md2gemini：Markdown转Gemini工具的功能解析

NodeJS实现的Gemini API客户端包装器

提升效率：Gemini CSS/JS开发用Sublime Text片段

Mac Gemini

gemini：Gemini协议服务器和客户端

agateway:Gemini协议的非常简单的代理服务器

Gemini Driver customization

Gemini_0440

ArubaInstant_Gemini

gemini_blog

最新资源