Gemini: A Family of Highly Capable Multimodal Models
5.1.6. Human Preference Evaluations
Human preference of the model outputs provides an important indication of quality that complements
automated evaluations. We have evaluated the Gemini models in side-by-side blind evaluations where
human raters judge responses of two models to the same prompt. We instruction tune (Ouyang et al.,
2022) the pretrained model using techniques discussed in the section 6.4.2. The instruction-tuned
version of the model is evaluated on a range of specific capabilities, such as following instructions,
creative writing, multimodal understanding, long-context understanding, and safety. These capabili-
ties encompass a range of use cases inspired by current user needs and research-inspired potential
future use cases.
Instruction-tuned Gemini Pro models provide a large improvement on a range of capabilities,
including preference for the Gemini Pro model over the PaLM 2 model API, 65.0% time in creative
writing, 59.2% in following instructions, and 68.5% time for safer responses as shown in Table 6.
These improvements directly translate into a more helpful and safer user experience.
Creativity
Instruction Follow-
ing
Safety
Win-rate 65.0% 59.2% 68.5%
95% Conf. Interval [62.9%, 67.1%] [57.6%, 60.8%] [66.0%, 70.8%]
Table 6 | Win rate of Gemini Pro over PaLM 2 (text-bison@001) with 95% confidence intervals.
5.1.7. Complex Reasoning Systems
Gemini can also be combined with additional techniques such as search and tool-use to create
powerful reasoning systems that can tackle more complex multi-step problems. One example of such
a system is AlphaCode 2, a new state-of-the-art agent that excels at solving competitive programming
problems (Leblond et al, 2023). AlphaCode 2 uses a specialized version of Gemini Pro – tuned on
competitive programming data similar to the data used in Li et al. (2022) – to conduct a massive
search over the space of possible programs. This is followed by a tailored filtering, clustering and
reranking mechanism. Gemini Pro is fine-tuned both to be a coding model to generate proposal
solution candidates, and to be a reward model that is leveraged to recognize and extract the most
promising code candidates.
AlphaCode 2 is evaluated on Codeforces,
5
the same platform as AlphaCode, on 12 contests from
division 1 and 2, for a total of 77 problems. AlphaCode 2 solved 43% of these competition problems, a
1.7x improvement over the prior record-setting AlphaCode system which solved 25%. Mapping this to
competition rankings, AlphaCode 2 built on top of Gemini Pro sits at an estimated 85th percentile on
average – i.e. it performs better than 85% of entrants. This is a significant advance over AlphaCode,
which only outperformed 50% of competitors.
The composition of powerful pretrained models with search and reasoning mechanisms is an
exciting direction towards more general agents; another key ingredient is deep understanding across
a range of modalities which we discuss in the next section.
5
http://codeforces.com/
11