Google PaLM 2：先进语言模型提升多语言推理与效率

5星 · 超过95%的资源 179 浏览量更新于2024-06-26 收藏 4.85MB PDF 举报

Google PaLM 2技术手册是一份关于最新人工智能领域的里程碑式成果。它介绍了一种名为PaLM 2的先进语言模型，该模型在多语言处理和推理能力上超越了其前一代——PaLM（Chowdhery等人，2022年的工作）。PaLM 2的设计基础是Transformer架构，它的训练采用了类似于UL2的混合目标方法（Tay等人，2023年），这种设计旨在提高模型的泛化能力和效率。通过深入的评估，PaLM 2在处理英语和其他多种语言任务以及推理挑战时，展现了显著的质量提升。在不同规模的下游任务中，PaLM 2不仅提供了更高质量的结果，而且在推理速度方面也更为迅速和高效。这种效率的提高使得PaLM 2在实际应用中的部署更加广泛，能以更快的速度响应，从而带来更流畅的人机交互体验。在诸如BIG-Bench这样的高级推理基准测试中，PaLM 2的表现明显优于PaLM，证明了其在逻辑推理方面的强大实力。此外，PaLM 2在负责 AI 评估中也展现出稳定的表现，特别体现在它能够在保证推理质量的同时，对毒性进行实时控制，这一特性尤为重要，因为它无需额外的成本或对其他功能产生负面影响。 Google PaLM 2作为当前最先进的语言模型，综合了出色的多语言处理、推理能力和高效的计算性能，为人工智能领域树立了新的标杆。无论是技术上的突破还是实际应用场景的考量，PaLM 2都证明了其在推动AI技术发展和应用中的核心地位。

Table 6: BIG-Bench Hard 3-shot results. PaLM and PaLM-2 use direct prediction and chain-of-thought prompting (Wei

et al., 2022) following the experimental setting of Suzgun et al. (2022).

Task Metric

PaLM PaLM 2 Absolute Gain Percent Gain

Direct/CoT Direct/CoT Direct/CoT Direct/CoT

boolean_expressions multiple choice grade 83.2/80.0 89.6/86.8 +6.4/+6.8 +8%/+8%

causal_judgment multiple choice grade 61.0/59.4 62.0/58.8 +1.0/-0.6 +2%/-1%

date_understanding multiple choice grade 53.6/79.2 74.0/91.2 +20.4/+12.0 +38%/+15%

disambiguation_qa multiple choice grade 60.8/67.6 78.8/77.6 +18.0/+10.0 +30%/+15%

dyck_languages multiple choice grade 28.4/28.0 35.2/63.6 +6.8/+35.6 +24%/+127%

formal_fallacies_syllogism_negation multiple choice grade 53.6/51.2 64.8/57.2 +11.2/+6.0 +21%/+12%

geometric_shapes multiple choice grade 37.6/43.6 51.2/34.8 +13.6/-8.8 +36%/-20%

hyperbaton multiple choice grade 70.8/90.4 84.8/82.4 +14.0/-8.0 +20%/-9%

logical_deduction multiple choice grade 42.7/56.9 64.5/69.1 +21.8/+12.2 +51%/+21%

movie_recommendation multiple choice grade 87.2/92.0 93.6/94.4 +6.4/+2.4 +7%/+3%

multistep_arithmetic_two exact string match 1.6/19.6 0.8/75.6 -0.8/+56.0 -50%/+286%

navigate multiple choice grade 62.4/79.6 68.8/91.2 +6.4/+11.6 +10%/+15%

object_counting exact string match 51.2/83.2 56.0/91.6 +4.8/+8.4 +9%/+10%

penguins_in_a_table multiple choice grade 44.5/65.1 65.8/84.9 +21.3/+19.8 +48%/+30%

reasoning_about_colored_objects multiple choice grade 38.0/74.4 61.2/91.2 +23.2/+16.8 +61%/+23%

ruin_names multiple choice grade 76.0/61.6 90.0/83.6 +14.0/+22.0 +18%/+36%

salient_translation_error_detection multiple choice grade 48.8/54.0 66.0/61.6 +17.2/+7.6 +35%/+14%

snarks multiple choice grade 78.1/61.8 78.7/84.8 +0.6/+23.0 +1%/+37%

sports_understanding multiple choice grade 80.4/98.0 90.8/98.0 +10.4/+0.0 +13%/+0%

temporal_sequences multiple choice grade 39.6/78.8 96.4/100.0 +56.8/+21.2 +143%/+27%

tracking_shufﬂed_objects multiple choice grade 19.6/52.9 25.3/79.3 +5.7/+26.4 +29%/+50%

web_of_lies multiple choice grade 51.2/100.0 55.2/100.0 +4.0/+0.0 +8%/+0%

word_sorting exact string match 32.0/21.6 58.0/39.6 +26.0/+18.0 +81%/+83%

Average - 52.3/65.2 65.7/78.1 +13.4/+12.9 +26% / 20%

Table 7: Evaluation results on MATH, GSM8K, and MGSM with chain-of-thought prompting (Wei et al., 2022) /

self-consistency (Wang et al., 2023). The PaLM result on MATH is sourced from (Lewkowycz et al., 2022), while

the PaLM result on MGSM is taken from (Chung et al., 2022).

Minerva (Lewkowycz et al., 2022),

GPT-4 (OpenAI,

2023b),

Flan-PaLM (Chung et al., 2022).

Task SOTA PaLM Minerva GPT-4 PaLM 2 Flan-PaLM 2

MATH 50.3

8.8 33.6 / 50.3 42.5 34.3 / 48.8 33.2 / 45.2

GSM8K 92.0

56.5 / 74.4 58.8 / 78.5 92.0 80.7 / 91.0 84.7 / 92.2

MGSM 72.0

45.9 / 57.9 - - 72.2 / 87.0 75.9 / 85.8

Table 8: Results on coding evaluations from the PaLM and PaLM 2-S* models. The PaLM 2-S* model is a version of

the PaLM 2-S model trained with additional code-related tokens, similar to PaLM-540B-Coder.

PaLM (Chowdhery

et al., 2022).

HumanEval MBPP ARCADE

pass@1 pass@100 pass@1 pass@80 pass@1 pass@30

PaLM 2-S* 37.6 88.4 50.0 86.6 16.2 43.6

PaLM-Coder-540B 35.9

88.4

47.0

80.8

7.9

33.6

4.4 Coding

Code language models are among the most economically signiﬁcant and widely-deployed LLMs today; code LMs

are deployed in diverse developer tooling (Github, 2021; Tabachnyk & Nikolov, 2022), as personal programming

assistants (OpenAI, 2022; Hsiao & Collins, 2023; Replit, 2022), and as competent tool-using agents (OpenAI, 2023a).

For low-latency, high-throughput deployment in developer workﬂows, we built a small, coding-speciﬁc PaLM 2 model

by continuing to train the PaLM 2-S model on an extended, code-heavy, heavily multilingual data mixture. We call the

resulting model

PaLM 2-S*

which shows signiﬁcant improvement on code tasks while preserving the performance on

natural language tasks. We evaluate PaLM 2-S*’s coding ability on a set of few-shot coding tasks, including HumanEval

(Chen et al., 2021), MBPP (Austin et al., 2021), and ARCADE (Yin et al., 2022). We also test PaLM 2-S*’s multilingual

coding ability using a version of HumanEval translated into a variety of lower-resource languages (Orlanski et al.,

2023).

Code Generation

We benchmark PaLM 2 on 3 coding datasets: HumanEval (Chen et al., 2021), MBPP (Austin et al.,

2021), and ARCADE (Yin et al., 2022). HumanEval and MBPP are natural language to code datasets which test the

model’s ability to generate self-contained Python programs that pass a set of held-out test cases. ARCADE is a Jupyter

Notebook completion task that requires the model to complete the next cell in a notebook given a textual description and

the preceding notebook cells. As in (Chen et al., 2021; Austin et al., 2021; Yin et al., 2022), we benchmark models in a

pass@1 and pass@k setting. We use greedy sampling for all pass@1 evals and temperature 0.8 with nucleus sampling

p = 0.95

for all pass@k evals. All samples are executed in a code sandbox with access to a small number of relevant

modules and careful isolation from the system environment. For ARCADE, we use the New Tasks split containing

problems from newly curated notebooks to avoid evaluation data leakage.

Results are shown in Table 8. PaLM 2-S* outperforms PaLM-540B-Coder on all benchmarks, often by a signiﬁcant

margin (e.g. ARCADE), despite being dramatically smaller, cheaper, and faster to serve.

Multilingual Evaluation

We also evaluate PaLM 2-S*’s multilingual coding abilities using BabelCode (Orlanski

et al., 2023) which translates HumanEval into a variety of other programming languages, including high-resource

languages like C++, Java, and Go and low-resource languages like Haskell and Julia. The PaLM 2 code training data is

signiﬁcantly more multilingual than PaLM’s, which we hope yields signiﬁcant gains on coding evals. Figure 6 shows

PaLM 2-S*’s results compared to the original PaLM models. We show an example of multilingual program generation

in Figure 7.

PaLM 2-S* outperforms PaLM on all but two languages, with surprisingly little degradation on low-resource languages

like Julia and Haskell; for instance PaLM 2-S* improves upon the much larger PaLM-Coder-540B by

6.3×

on Haskell

and on Julia by

4.7×

. Remarkably, Java, JavaScript and TypeScript performance is actually higher than Python, the

original language.

4.5 Translation

An explicit design choice of PaLM 2 is an improved translation capability. In this section, we evaluate sentence-level

translation quality using recommended practices for high-quality machine translation (Vilar et al., 2022), and measure

C# C++ Go Haskell Java JS Julia Lua PHP Python Rust TS

Language

Pass@1

PaLM 540B

PaLM-Coder 540B

PaLM 2-S*

Figure 6: BabelCode-HumanEval results on 12 programming languages in the pass@1 setting. The Python results are

not directly comparable to standard HumanEval due to differences in the evaluation procedure. Raw numeric results are

shown in Table 18.

Table 9: Results on WMT21 translation sets. We observe improvement over both PaLM and the Google Translate

production system according to our primary metric: MQM human evaluations by professional translators.

Chinese−→English English−→German

BLEURT ↑ MQM (Human) ↓ BLEURT ↑ MQM (Human) ↓

PaLM 67.4 3.7 71.7 1.2

Google Translate 68.5 3.1 73.0 1.0

PaLM 2 69.2 3.0 73.3 0.9

potential misgendering harms from translation errors.

WMT21 Experimental Setup

We use the recent WMT 2021 sets (Akhbardeh et al., 2021) to guard against train/test

data leakage, and to facilitate comparison with the state of the art. We compare PaLM 2 against PaLM and Google

Translate. For PaLM and PaLM 2, we prompt the model with 5-shot exemplars; for Google Translate, we send the

source text directly to the model, as this is the format it expects.

We use two metrics for evaluation:

1. BLEURT

(Sellam et al., 2020): We use BLEURT

(Sellam et al., 2020) as a SOTA automatic metric instead of

BLEU (Papineni et al., 2002) due to BLEU’s poor correlation with human judgements of quality, especially for

high-quality translations (Freitag et al., 2022).

2. MQM

(Freitag et al., 2021): To compute Multidimensional Quality Metrics (MQM), we hired professional

translators (7 for English-to-German, 4 for Chinese-to-English) and measured translation quality with a document

context version of MQM that mimics the setup proposed in Freitag et al. (2021), which includes the same error

categories, severity levels and error weighting schema. Following Freitag et al. (2021), we assign the following

weights: 5 for each major error, 1 for each minor error, and 0.1 for minor punctuation errors. The ﬁnal system-level

score is an average over scores from all annotations.

We present the results of an MQM study for Chinese-to-English and English-to-German in Table 9. MQM represents

the average errors per segment, with lower numbers indicating better results. We observe that PaLM 2 improves quality

both over PaLM and Google Translate.

We used BLEURT version 0p2p1 for our measurements.

Table 10: Results on the FRMT (Few-shot Regional Machine Translation) benchmark of dialect-speciﬁc translation.

Inputs are 5-shot exemplars and scores are computed with BLEURT.

Portuguese Portuguese Chinese Chinese

(Brazil) (Portugal) (Mainland) (Taiwan)

PaLM 78.5 76.1 70.3 68.6

Google Translate 80.2 75.3 72.3 68.5

PaLM 2 81.1 78.3 74.4 72.0

Regional translation experimental setup

We also report results on the FRMT benchmark (Riley et al., 2023) for

Few-shot Regional Machine Translation. By focusing on region-speciﬁc dialects, FRMT allows us to measure PaLM

2’s ability to produce translations that are most appropriate for each locale—translations that will feel natural to each

community. We show the results in Table 10. We observe that PaLM 2 improves not only over PaLM but also over

Google Translate in all locales.

Potential misgendering harms

We measure PaLM 2 on failures that can lead to potential misgendering harms in

zero-shot translation. When translating into English, we ﬁnd stable performance on PaLM 2 compared to PaLM, with

small improvements on worst-case disaggregated performance across 26 languages. When translating out of English into

13 languages, we evaluate gender agreement and translation quality with human raters. Surprsingly, we ﬁnd that even in

the zero-shot setting PaLM 2 outperforms PaLM and Google Translate on gender agreement in three high-resource

languages: Spanish, Polish and Portuguese. We observe lower gender agreement scores when translating into Telugu,

Hindi and Arabic with PaLM 2 as compared to PaLM. See Appendix E.5 for results and analysis.

4.6 Natural language generation

Due to their generative pre-training, natural language generation (NLG) rather than classiﬁcation or regression has

become the primary interface for large language models. Despite this, however, models’ generation quality is rarely

evaluated, and NLG evaluations typically focus on English news summarization. Evaluating the potential harms or

bias in natural language generation also requires a broader approach, including considering dialog uses and adversarial

prompting. We evaluate PaLM 2’s natural language generation ability on representative datasets covering a typologically

diverse set of languages

• XLSum

(Hasan et al., 2021), which asks a model to summarize a news article in the same language in a single

sentence, in Arabic, Bengali, English, Japanese, Indonesian, Swahili, Korean, Russian, Telugu, Thai, and Turkish.

• WikiLingua

(Ladhak et al., 2020), which focuses on generating section headers for step-by-step instructions

from WikiHow, in Arabic, English, Japanese, Korean, Russian, Thai, and Turkish.

• XSum (Narayan et al., 2018), which tasks a model with generating a news article’s ﬁrst sentence, in English.

We compare PaLM 2 to PaLM using a common setup and re-compute PaLM results for this work. We use a custom

1-shot prompt for each dataset, which consists of an instruction, a source document, and its generated summary,

sentence, or header. As evaluation metrics, we use ROUGE-2 for English, and SentencePiece-ROUGE-2, an extension

of ROUGE that handles non-Latin characters using a SentencePiece tokenizer—in our case, the mT5 (Xue et al., 2021)

tokenizer—for all other languages.

We focus on the 1-shot-learning setting, as inputs can be long. We truncate extremely long inputs to about half the max

input length, so that instructions and targets can always ﬁt within the model’s input. We decode a single output greedily

and stop at an exemplar separator (double newline), or continue decoding until the maximum decode length, which is

set to the 99th-percentile target length.

We focus on the set of typologically diverse languages also used in TyDi QA (Clark et al., 2020).

剩余91页未读，继续阅读

此星光明

粉丝: 8w+
资源: 1323

Google PaLM 2：先进语言模型提升多语言推理与效率

"谷歌PaLM 2技术报告：多语言与推理能力更强，计算效率更高

谷歌PaLM2模型发布：搜索与办公应用新突破，移动端AIGC生态展望

谷歌PaLM2反击ChatGPT，AI孙燕姿引领新潮流

谷歌PaLM 2技术报告Palm2 Tech Report.pdf

palm webos开发手册

PaLM 2 Technical Report ，PaLM 2技术报告

笨鸟版PALM培训手册

AI技术新突破：DALL·E 3、PaLM 2引领创新，谷歌Gemini崭露头角

palm2怎么使用

palm2ical-开源

最新资源