大型语言模型：突破与展望

版权申诉

183 浏览量更新于2024-06-14 收藏 1.53MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"全面概述大型语言模型.pdf" 大型语言模型（Large Language Models，LLMs）是当前人工智能领域的一个重要研究焦点，这些模型在自然语言处理任务中展现出了卓越的能力，并且其应用已经超越了传统的文本理解与生成。由于其显著的成功，吸引了大量研究者在这个方向上进行贡献，涵盖各种各样的主题，包括架构创新、训练方法、模型优化以及道德和社会影响。大型语言模型的基础是深度学习，特别是基于Transformer架构的神经网络，该架构由Vaswani等人在2017年提出。Transformer模型通过自注意力机制（self-attention mechanism）能够理解和处理序列数据中的长距离依赖关系，使得模型能够学习到更丰富的上下文信息。随着模型规模的增大，例如参数数量的增加，LLMs在多项自然语言处理任务中取得了前所未有的性能，如机器翻译、问答系统、文本生成和对话理解。预训练与微调是LLMs的核心训练策略。预训练阶段，模型在大规模无标注文本数据集（如Web文本、维基百科等）上进行训练，学习通用的语言表示。之后，在特定任务的有标注数据集上进行微调，以适应特定的应用场景。这种方法使得模型能够捕获语言的一般规律，同时具备泛化到新任务的能力。近年来，一些大型模型，如Google的BERT、Facebook的RoBERTa、OpenAI的GPT系列以及阿里云的Qwen等，已经推动了语言模型的边界。这些模型不仅在标准基准测试上打破了记录，而且在零样本或少样本学习等高效学习模式下也表现优异。然而，随着模型规模的扩大，计算成本和环境影响也成为关注的问题。因此，研究者们开始探索更高效、更绿色的训练方法，如模型蒸馏、参数共享和低精度计算。此外，模型解释性、隐私保护和安全性的研究也在进行中，以确保这些强大的工具在实际应用中的可靠性和责任性。另一方面，大型语言模型的伦理和社会问题也不容忽视。它们可能在训练过程中学习到偏见和有害信息，导致在生成内容时重现这些偏见。因此，公平性和社会责任成为LLMs研究中的重要议题，包括数据清洗、模型审计和算法治理。大型语言模型是自然语言处理领域的前沿，其技术发展和广泛应用正深刻地改变着我们的交流方式。随着技术的不断进步，我们期待看到更智能、更负责任的LLMs为人类带来更多的便利。然而，这也要求我们持续关注和解决其带来的挑战，以确保技术的健康发展。

资源详情

资源推荐

PREPRINT 8

Fig. 8: The image is the article of [103], showing an example

of PanGu-α architecture.

1.5 CPM-2 [12]: Cost-efﬁcient Pre-trained language

Models (CPM-2) pre-trains bilingual (English and Chinese)

11B and 198B mixture-of-experts (MoE) models on the Wu-

DaoCorpus [104] dataset. The tokenization process removes

“_” white space tokens in the sentencepiece tokenizer. The

models are trained with knowledge inheritance, starting with

only the Chinese language in the ﬁrst stage and then adding

English and Chinese data. This trained model gets duplicated

multiple times to initialize the 198B MoE model. Moreover,

to use the model for downstream tasks, CPM-2 experimented

with both complete ﬁne-tuning and prompt ﬁne-tuning as

in [40] where only prompt-related parameters are updated

by inserting prompts at various positions, front, middle, and

back. CPM-2 also proposes INFMOE, a memory-efﬁcient

framework with a strategy to dynamically ofﬂoad parameters

to the CPU for inference at a 100B scale. It overlaps data

movement with inference computation for lower inference

time.

1.6 ERNIE 3.0 [105]: ERNIE 3.0 takes inspiration from

multi-task learning to build a modular architecture using

Transformer-XL [106] as the backbone. The universal repre-

sentation module is shared by all the tasks, which serve as the

basic block for task-speciﬁc representation modules, which are

all trained jointly for natural language understanding, natural

language generation, and knowledge extraction. This LLM is

primarily focused on the Chinese language, claims to train

on the largest Chinese text corpora for LLM training, and

achieved state-of-the-art in 54 Chinese NLP tasks.

1.7 Jurassic-1 [107]: A pair of auto-regressive language

models, including a 7B-parameter J1-Large model and a

178B-parameter J1-Jumbo model. The training vocabulary of

Jurassic-1 comprise word pieces, complete words, and multi-

word expressions without any word boundaries, where possible

out-of-vocabulary instances are interpreted as Unicode bytes.

Compared to the GPT-3 counterparts, the Jurassic-1 models

apply a more balanced depth-to-width self-attention architec-

ture [108] and an improved tokenizer for a faster prediction

based on broader resources, achieving a comparable perfor-

mance in zero-shot learning tasks and a superior performance

in few-shot learning tasks given the ability to feed more

examples as a prompt.

1.8 HyperCLOVA [109]: A Korean language model with

GPT-3 architecture.

1.9 Yuan 1.0 [110]: Trained on a Chinese corpus with

5TB of high-quality text collected from the Internet. A

Massive Data Filtering System (MDFS) built on Spark is

developed to process the raw data via coarse and ﬁne ﬁltering

techniques. To speed up the training of Yuan 1.0 with the

aim of saving energy expenses and carbon emissions, various

factors that improve the performance of distributed training

are incorporated in architecture and training like increasing

the number of hidden size improves pipeline and tensor par-

allelism performance, larger micro batches improve pipeline

parallelism performance, and higher global batch size improve

data parallelism performance. In practice, the Yuan 1.0 model

performs well on text classiﬁcation, Winograd Schema, natural

language inference, and reading comprehension tasks.

1.10 Gopher [111]: The Gopher family of models ranges

from 44M to 280B parameters in size to study the effect of

scale on the LLMs performance. The 280B model beats GPT-

3 [6], Jurrasic-1 [107], MT-NLG [112], and others on 81% of

the evaluated tasks.

1.11 ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends

ERNIE 3.0 by training a larger model with 26x the number of

parameters of the latter. This bigger model outperformed other

state-of-the-art models in 68 NLP tasks. LLMs produce text

with incorrect facts. In order to have control of the generated

text with factual consistency, ERNIE 3.0 Titan adds another

task, Credible and Controllable Generations, to its multi-

task learning setup. It introduces additional self-supervised

adversarial and controllable language modeling losses to the

pre-training step, which enables ERNIE 3.0 Titan to beat

other LLMs in their manually selected Factual QA task set

evaluations.

1.12 GPT-NeoX-20B [113]: An auto-regressive model

that largely follows GPT-3 with a few deviations in architec-

ture design, trained on the Pile dataset without any data dedu-

plication. GPT-NeoX has parallel attention and feed-forward

layers in a transformer block, given in Eq. 4, that increases

throughput by 15%. It uses rotary positional embedding [66],

applying it to only 25% of embedding vector dimension as

in [114]. This reduces the computation without performance

degradation. Opposite to GPT-3, which uses dense and sparse

layers, GPT-NeoX-20B uses only dense layers. The hyperpa-

rameter tuning at this scale is difﬁcult; therefore, the model

chooses hyperparameters from the method [6] and interpolates

values between 13B and 175B models for the 20B model. The

model training is distributed among GPUs using both tensor

and pipeline parallelism.

x + Attn(LN

(x)) + F F (LN

(x)) (4)

1.13 OPT [14]: It is a clone of GPT-3, developed with

the intention to open-source a model that replicates GPT-3

performance. Training of OPT employs dynamic loss scaling

[115] and restarts from an earlier checkpoint with a lower

learning rate whenever loss divergence is observed. Overall,

the performance of OPT-175B models is comparable to the

GPT3-175B model.

PREPRINT 9

Fig. 9: The BLOOM architecture example sourced from [13].

1.14 BLOOM [13]: A causal decoder model trained

on ROOTS corpus with the aim of open-sourcing an LLM.

The architecture of BLOOM is shown in Figure 9, with

differences like ALiBi positional embedding, an additional

normalization layer after the embedding layer as suggested

by the bitsandbytes

library. These changes stabilize training

with improved downstream performance.

1.15 GLaM [116]: Generalist Language Model (GLaM)

represents a family of language models using a sparsely acti-

vated decoder-only mixture-of-experts (MoE) structure [117],

[118]. To gain more model capacity while reducing compu-

tation, the experts are sparsely activated where only the best

two experts are used to process each input token. The largest

GLaM model, GLaM (64B/64E), is about 7× larger than GPT-

3 [6], while only a part of the parameters is activated per input

token. The largest GLaM (64B/64E) model achieves better

overall results as compared to GPT-3 while consuming only

one-third of GPT-3’s training energy.

1.16 MT-NLG [112]: A 530B causal decoder based on

GPT-2 architecture that is roughly 3× GPT-3 model parame-

ters. MT-NLG is trained on ﬁltered high-quality data collected

from various public datasets and blends various types of

datasets in a single batch, which beats GPT-3 on a number

of evaluations.

1.17 Chinchilla [119]: A causal decoder trained on the

same dataset as the Gopher [111] but with a little different

data sampling distribution (sampled from MassiveText). The

model architecture is similar to the one used for Gopher,

with the exception of AdamW optimizer instead of Adam.

Chinchilla identiﬁes the relationship that model size should

be doubled for every doubling of training tokens. Over 400

language models ranging from 70 million to over 16 billion

parameters on 5 to 500 billion tokens are trained to get the

estimates for compute-optimal training under a given budget.

The authors train a 70B model with the same compute budget

as Gopher (280B) but with 4 times more data. It outperforms

Gopher [111], GPT-3 [6], and others on various downstream

tasks, after ﬁne-tuning.

1.18 AlexaTM [120]: An encoder-decoder model, where

encoder weights and decoder embeddings are initialized with

a pre-trained encoder to speedup training. The encoder stays

frozen for initial 100k steps and later unfreezed for end-to-end

training. The model is trained on a combination of denoising

https://github.com/TimDettmers/bitsandbytes

and causal language modeling (CLM) objectives, concate-

nating [CLM] token at the beginning for mode switiching.

During training, the CLM task is applied for 20% of the time,

which improves the in-context learning performance.

1.19 PaLM [15]: A causal decoder with parallel atten-

tion and feed-forward layers similar to Eq. 4, speeding up

training 15 times faster. Additional changes to the conven-

tional transformer model include SwiGLU activation, RoPE

embeddings, multi-query attention that saves computation cost

during decoding, and shared input-output embeddings. During

training, loss spiking was observed, and to ﬁx it, model

training was restarted from a 100 steps earlier checkpoint

by skipping 200-500 batches around the spike. Moreover, the

model was found to memorize around 2.4% of the training

data at the 540B model scale, whereas this number was lower

for smaller models.

PaLM-2 [121]: A smaller multi-lingual variant of PaLM,

trained for larger iterations on a better quality dataset. The

PaLM-2 shows signiﬁcant improvements over PaLM, while

reducing training and inference costs due to its smaller size.

To lessen toxicity and memorization, it appends special tokens

with a fraction of pre-training data, which shows reduction in

generating harmful responses.

1.20 U-PaLM [122]: This method trains PaLM for 0.1%

additional compute with UL2 (also named as UL2Restore) ob-

jective [123] using the same dataset and outperforms baseline

signiﬁcantly on various NLP tasks, including zero-shot, few-

shot, commonsense reasoning, CoT, etc. Training with UL2R

involves converting a causal decoder PaLM to a non-causal

decoder PaLM and employing 50% sequential denoising, 25%

regular denoising, and 25% extreme denoising loss functions.

1.21 UL2 [123]: An encoder-decoder architecture

trained using a mixture of denoisers (MoD) objectives. De-

noisers include 1) R-Denoiser: a regular span masking, 2)

S-Denoiser: which corrupts consecutive tokens of a large

sequence and 3) X-Denoiser: which corrupts a large number of

tokens randomly. During pre-training, UL2 includes a denoiser

token from R, S, X to represent a denoising setup. It helps

improve ﬁne-tuning performance for downstream tasks that

bind the task to one of the upstream training modes. This

MoD style of training outperforms the T5 model on many

benchmarks.

1.22 GLM-130B [33]: GLM-130B is a bilingual (En-

glish and Chinese) model trained using an auto-regressive

mask inﬁlling pre-training objective similar to the GLM [124].

This training style makes the model bidirectional as compared

to GPT-3, which is unidirectional. Opposite to the GLM, the

training of GLM-130B includes a small amount of multi-task

instruction pre-training data (5% of the total data) along with

the self-supervised mask inﬁlling. To stabilize the training, it

applies embedding layer gradient shrink.

1.23 LLaMA [125], [21]: A set of decoder-only lan-

guage models varying from 7B to 70B parameters. LLaMA

models series is the most famous among the community for

parameter-efﬁcient and instruction tuning.

LLaMA-1 [125]: Implements efﬁcient causal attention [126]

by not storing and computing masked attention weights and

key/query scores. Another optimization is reducing number of

剩余42页未读，继续阅读

百态老人

粉丝: 5099
资源: 2万+

大型语言模型：突破与展望

大语言模型介绍.docx

大型语言模型（Large Language Models，LLMs）概览.docx

Generative AI 新世界：大型语言模型（LLMs）概述.pdf

基于数据挖掘技术的客户流失预警模型.pdf

universal portfolio cover .pdf

汽车lin总线基本协议概述.pdf

csdncuda编程指南5.0.pdf

数字化转型参考架构.pdf

pdk5s-p-003-um_cn.pdf

机器人学导论.pdf

h3cse-wlan.pdf

power bi 官方中文教程完整版.pdf

hisvp 开发指南.pdf

sap mdg 9.0介绍.pdf

西门子plc教材.pdf

.net6.0中文文档.pdf

sys bios (ti-rtos 内核) v6.41 用户手册.pdf

ug902-vivado-high-level-synthesis.pdf

isecure center综合安防管理平台配置手册.pdf

tee_client_api_specification-v1.0_c.pdf

最新资源