"GLM：通用语言模型的预训练和自回归填充技术"

需积分: 0 11 浏览量更新于2024-03-14 收藏 836KB PDF 举报

The General Language Model (GLM) paper introduces a novel pretraining method called General Language Model Pretraining with Autoregressive Blank Infilling. The research team includes Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang from Tsinghua University, Beijing Academy of Artificial Intelligence (BAAI), MIT CSAIL, and Shanghai Qi Zhi Institute. The paper discusses the development of pretraining architectures, highlighting autoencoding models like BERT and autoregressive models like GPT. GLM combines the strengths of both approaches by incorporating autoregressive blank infilling. This method aims to improve the efficiency and effectiveness of language model pretraining by filling in the missing tokens in a sequence in an autoregressive manner. The researchers provide a detailed overview of the GLM architecture, including the components and processes involved in the pretraining phase. They also present experimental results that demonstrate the superior performance of GLM compared to existing pretraining models. The paper concludes with future research directions and potential applications of GLM in various natural language processing tasks. Overall, the GLM paper represents a significant advancement in the field of language model pretraining. By combining the strengths of autoencoding and autoregressive models, GLM offers a more efficient and effective pretraining method that can benefit a wide range of natural language processing applications.

Coronet has the best lines of all day cruisers.

Positive

<latexit sha1_base64="cb5S93r+RGEy3gKCUaUf2i3SjJQ=">AAAB6HicbVDJSgNBEK2JW4xb1KMijUHwFGY8qMegF48JmAWSIfR0apI2PQvdPcIw5OjJiwdFvPoV+Q5vfoM/YWc5aPRBweO9KqrqebHgStv2p5VbWl5ZXcuvFzY2t7Z3irt7DRUlkmGdRSKSLY8qFDzEuuZaYCuWSANPYNMbXk/85j1KxaPwVqcxugHth9znjGoj1dJusWSX7SnIX+LMSalyOK59PRyNq93iR6cXsSTAUDNBlWo7dqzdjErNmcBRoZMojCkb0j62DQ1pgMrNpoeOyIlResSPpKlQk6n6cyKjgVJp4JnOgOqBWvQm4n9eO9H+pZvxME40hmy2yE8E0RGZfE16XCLTIjWEMsnNrYQNqKRMm2wKJgRn8eW/pHFWds7Lds2kcQUz5OEAjuEUHLiACtxAFerAAOERnuHFurOerFfrbdaas+Yz+/AL1vs31mKQqQ==</latexit>

good

<latexit sha1_base64="CZ8abL6hJorh4jxHBPAGp8NppAA=">AAAB63icbVBNSwMxEJ2tX7V+VT16CS1CRSi7HtRj0YvHCvYD2qVk07QNTbJLki0sS/+CFwVFvPqHvPXfmG170NYHA4/3ZpiZF0ScaeO6Mye3sbm1vZPfLeztHxweFY9PmjqMFaENEvJQtQOsKWeSNgwznLYjRbEIOG0F4/vMb02o0iyUTyaJqC/wULIBI9hk0qSSXPSKZbfqzoHWibck5Vqpe/k6qyX1XvG72w9JLKg0hGOtO54bGT/FyjDC6bTQjTWNMBnjIe1YKrGg2k/nt07RuVX6aBAqW9Kgufp7IsVC60QEtlNgM9KrXib+53ViM7j1Uyaj2FBJFosGMUcmRNnjqM8UJYYnlmCimL0VkRFWmBgbT8GG4K2+vE6aV1Xvuuo+2jTuYIE8nEEJKuDBDdTgAerQAAIjeIY3eHeE8+J8OJ+L1pyznDmFP3C+fgCio5Dy</latexit>

v(y)

GLM

<latexit sha1_base64="cIlXHKTMHL8y94GI+KZXnlT1K7g=">AAAB7XicbVDLSgNBEOyNrxhfUY9ehgQhIoRdD+ox6MVjBPOAZAmzk9lkzOzMMjMrLjH/4EEPinj1f7zlb5w8DppY0FBUddPdFcScaeO6Yyezsrq2vpHdzG1t7+zu5fcP6lomitAakVyqZoA15UzQmmGG02asKI4CThvB4HriNx6o0kyKO5PG1I9wT7CQEWysVI9L6dPjSSdfdMvuFGiZeHNSrBTapy/jSlrt5L/bXUmSiApDONa65bmx8YdYGUY4HeXaiaYxJgPcoy1LBY6o9ofTa0fo2CpdFEplSxg0VX9PDHGkdRoFtjPCpq8XvYn4n9dKTHjpD5mIE0MFmS0KE46MRJPXUZcpSgxPLcFEMXsrIn2sMDE2oJwNwVt8eZnUz8reedm9tWlcwQxZOIIClMCDC6jADVShBgTu4Rne4N2Rzqvz4XzOWjPOfOYQ/sD5+gFYz5H0</latexit>

p(y|x)

It is really [MASK]

<latexit sha1_base64="XknPsXXFT3s+dKVLzg736M6sfhc=">AAAB9XicbVC7TsMwFL3hWcKrwMgSUSExVQkDsCAqWBiLRB9SGyrHcVqrjh3ZDlBF/Q8WBh5i5TPYWRB/g9N2gJYjWT465175+AQJo0q77rc1N7+wuLRcWLFX19Y3Notb23UlUolJDQsmZDNAijDKSU1TzUgzkQTFASONoH+R+41bIhUV/FoPEuLHqMtpRDHSRrppB4KFahCbK7sfdoolt+yO4MwSb0JKZx/2afLyZVc7xc92KHAaE64xQ0q1PDfRfoakppiRod1OFUkQ7qMuaRnKUUyUn41SD519o4ROJKQ5XDsj9fdGhmKVRzOTMdI9Ne3l4n9eK9XRiZ9RnqSacDx+KEqZo4WTV+CEVBKs2cAQhCU1WR3cQxJhbYqyTQne9JdnSf2w7B2V3Su3VDmHMQqwC3twAB4cQwUuoQo1wCDhAZ7g2bqzHq1X6208OmdNdnbgD6z3H7R5lks=</latexit>

Figure 3: Formulation of the sentiment classiﬁcation

task as blank inﬁlling with GLM.

reconstructing them. It is an important difference

as compared to other models. For example, XL-

Net (Yang et al., 2019) encodes the original posi-

tion so that it can perceive the number of missing

tokens, and SpanBERT (Joshi et al., 2020) replaces

the span with multiple [MASK] tokens and keeps

the length unchanged. Our design ﬁts downstream

tasks as usually the length of the generated text is

unknown beforehand.

2.3 Finetuning GLM

Typically, for downstream NLU tasks, a linear clas-

siﬁer takes the representations of sequences or to-

kens produced by pretrained models as input and

predicts the correct labels. The practices are differ-

ent from the generative pretraining task, leading to

inconsistency between pretraining and ﬁnetuning.

Instead, we reformulate NLU classiﬁcation tasks

as generation tasks of blank inﬁlling, following

PET (Schick and Schütze, 2020a). Speciﬁcally,

given a labeled example

(x, y)

, we convert the in-

put text

to a cloze question

c(x)

via a pattern

containing a single mask token. The pattern is writ-

ten in natural language to represent the semantics

of the task. For example, a sentiment classiﬁcation

task can be formulated as “{SENTENCE}. It’s

really

[MASK]

”. The candidate labels

y ∈ Y

are

also mapped to answers to the cloze, called ver-

balizer

v(y)

. In sentiment classiﬁcation, the labels

“positive” and “negative” are mapped to the words

“good” and “bad”. The conditional probability of

predicting y given x is

p(y|x) =

p(v(y)|c(x))

∈Y

p(v(y

)|c(x))

(3)

where

is the label set. Therefore the probability

of the sentence being positive or negative is propor-

tional to predicting “good” or “bad” in the blank.

Then we ﬁnetune GLM with a cross-entropy loss

(see Figure 3).

For text generation tasks, the given context con-

stitutes the Part A of the input, with a mask token

appended at the end. The model generates the text

of Part B autoregressively. We can directly apply

the pretrained GLM for unconditional generation,

or ﬁnetune it on downstream conditional generation

tasks.

2.4 Discussion and Analysis

In this section, we discuss the differences between

GLM and other pretraining models. We are mainly

concerned with how they can be adapted to down-

stream blank inﬁlling tasks.

Comparison with BERT (Devlin et al., 2019).

As pointed out by (Yang et al., 2019), BERT fails

to capture the interdependencies of masked tokens

due to the independence assumption of MLM. An-

other disadvantage of BERT is that it cannot ﬁll in

the blanks of multiple tokens properly. To infer the

probability of an answer of length

, BERT needs

to perform

consecutive predictions. If the length

is unknown, we may need to enumerate all possible

lengths, since BERT needs to change the number

of [MASK] tokens according to the length.

Comparison with XLNet (Yang et al., 2019).

Both GLM and XLNet are pretrained with autore-

gressive objectives, but there are two differences

between them. First, XLNet uses the original posi-

tion encodings before corruption. During inference,

we need to either know or enumerate the length of

the answer, the same problem as BERT. Second,

XLNet uses a two-stream self-attention mechanism,

instead of the right-shift, to avoid the information

leak within Transformer. It doubles the time cost

of pretraining.

Comparison with T5 (Raffel et al., 2020).

proposes a similar blank inﬁlling objective to pre-

train an encoder-decoder Transformer. T5 uses

independent positional encodings for the encoder

and decoder, and relies on multiple sentinel tokens

to differentiate the masked spans. In downstream

tasks, only one of the sentinel tokens is used, lead-

ing to a waste of model capacity and inconsistency

between pretraining and ﬁnetuning. Moreover, T5

always predicts spans in a ﬁxed left-to-right order.

As a result, GLM can signiﬁcantly outperform T5

on NLU and seq2seq tasks with fewer parameters

and data, as stated in Sections 3.2 and 3.3.

Comparison with UniLM (Dong et al., 2019).

UniLM combines different pretraining objectives

under the autoencoding framework by changing the

剩余15页未读，继续阅读

wangfh7

粉丝: 12
资源: 2

"GLM：通用语言模型的预训练和自回归填充技术"

为GPT-GLM等LLM语言模型提供实用的交互界面，优化论文的阅读抛光写作体验

基于GPT/GLM等LLM大语言模型的论文阅读/润色/写作实用化交互接口设计源码

大语言模型调研汇总，介绍多种模型

linux系统部署chat-glm-6b-int4

chat-GLM显存不足

linux系统chat-glm-6b-int4部署

大模型GLM-4V的原理

最新资源