GPT-3：OpenAI引领小样本学习新纪元

需积分: 42 196 浏览量更新于2024-07-15 收藏 6.44MB PDF 举报

"OpenAI发布了一款名为GPT-3的小样本学习器语言模型，该模型具有1750亿个参数，是目前最先进的语言模型。论文由32位作者共同完成，共计72页，详细阐述了如何通过大规模预训练和微调，使模型在没有大量特定任务数据的情况下，仅凭少量示例或简单指令就能执行新的语言任务。" 在《Language Models are Few-Shot Learners》这篇论文中，OpenAI团队揭示了小样本学习在自然语言处理（NLP）领域的巨大潜力。传统的NLP任务处理方式通常是在大规模文本数据上进行预训练，然后针对具体任务进行微调，这个过程需要数千甚至数万个例子。然而，GPT-3展示了一种全新的可能性，它在不需大量任务特定数据的情况下，仅通过少量示例就能适应并执行新的语言任务，这更接近人类的学习方式。小样本学习（Few-Shot Learning）是指机器学习模型在仅有少数样例的情况下，能够快速理解和执行新任务的能力。在GPT-3的案例中，这一特性显著提高了模型的泛化能力，使其能应对各种未曾见过的任务。这种进步对于减少对大规模标注数据的依赖、提高模型的灵活性和实用性具有重要意义。 GPT-3的规模空前，拥有1750亿个参数，远超其前代GPT-2的15亿参数。模型的大规模使得它能捕获更多复杂的语言结构和模式，这对于理解和生成自然语言至关重要。大规模的参数数量也意味着更高的计算需求，但这并未削弱其在小样本设置下的表现力。在实验中，GPT-3展示了多种NLP任务的出色性能，包括但不限于文本生成、问答系统、翻译、代码编写等。这些结果表明，随着模型规模的扩大，其在任务泛化方面的性能也在显著提升。同时，论文还讨论了模型的局限性，如理解复杂推理和长期依赖的问题，以及在某些情况下可能产生的不准确或有害的输出。 OpenAI的GPT-3展示了小样本学习在语言模型中的强大应用，预示着未来NLP领域的发展方向将更加侧重于模型的泛化能力和适应性，而不仅仅是对特定任务的优化。这不仅有助于减少数据收集和标注的成本，也为构建更加智能和自主的AI系统奠定了基础。然而，随着模型规模的增加，也带来了训练和部署上的挑战，需要寻找更高效、更经济的解决方案。

Setting

LAMBADA

(acc)

LAMBADA

(ppl)

StoryCloze

(acc)

HellaSwag

(acc)

SOTA 68.0

8.63

91.8

85.6

GPT-3 Zero-Shot 76.2 3.00 83.2 78.9

GPT-3 One-Shot 72.5 3.35 84.7 78.1

GPT-3 Few-Shot 86.4 1.92 87.7 79.3

Table 3.2: Performance on cloze and completion tasks.

GPT-3 signiﬁcantly improves SOTA on LAMBADA while

achieving respectable performance on two difﬁcult completion prediction datasets.

[

Tur20

]

[

RWC

]

[

LDL19

]

[LCH

20]

Figure 3.2:

On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3

2.7B outperforms the SOTA 17B parameter Turing-NLG [

Tur20

] in this setting, and GPT-3 175B advances the state of

the art by 18%. Note zero-shot uses a different format from one-shot and few-shot as described in the text.

and [

Tur20

]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path

forward”. We ﬁnd that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of

8% over the previous state of the art.

LAMBADA is also a demonstration of the ﬂexibility of few-shot learning as it provides a way to address a problem that

classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a

standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but

also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word

ﬁlters [

RWC

] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a

cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We

use the following ﬁll-in-the-blank format:

Alice was friends with Bob. Alice went to visit her friend . → Bob

George bought some baseball equipment, a ball, a glove, and a . →

When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase

of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model

size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy

by 10%. Finally, the ﬁll-in-blank method is not effective one-shot, where it always performs worse than the zero-shot

setting. Perhaps this is because all models still require several examples to recognize the pattern.

Setting NaturalQS WebQS TriviaQA

RAG (Fine-tuned, Open-Domain) [LPP

20] 44.5 45.5 68.0

T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5

T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1

GPT-3 Zero-Shot 14.6 14.4 64.3

GPT-3 One-Shot 23.0 25.3 68.0

GPT-3 Few-Shot 29.9 41.5 71.2

Table 3.3: Results on three Open-Domain QA tasks.

GPT-3 is shown in the few-, one-, and zero-shot settings, as

compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the

wiki split test server.

One note of caution is that an analysis of test set contamination identiﬁed that a signiﬁcant minority of the LAMBADA

dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact

on performance.

3.1.3 HellaSwag

The HellaSwag dataset [

ZHB

] involves picking the best ending to a story or set of instructions. The examples were

adversarially mined to be difﬁcult for language models while remaining easy for humans (who achieve 95.6% accuracy).

GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the

75.4% accuracy of a ﬁne-tuned 1.5B parameter language model [

ZHR

] but still a fair amount lower than the overall

SOTA of 85.6% achieved by the ﬁne-tuned multi-task model ALUM.

3.1.4 StoryCloze

We next evaluate GPT-3 on the StoryCloze 2016 dataset [

MCH

], which involves selecting the correct ending

sentence for ﬁve-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot

setting (with

K = 70

). This is still 4.1% lower than the ﬁne-tuned SOTA using a BERT based model [

LDL19

] but

improves over previous zero-shot results by roughly 10%.

3.2 Closed Book Question Answering

In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense

amount of possible queries, this task has normally been approached by using an information retrieval system to ﬁnd

relevant text in combination with a model which learns to generate an answer given the question and the retrieved

text. Since this setting allows a system to search for and condition on text which potentially contains the answer it

is denoted “open-book”. [

RRS20

] recently demonstrated that a large language model can perform surprisingly well

directly answering the questions without conditioning on auxilliary information. They denote this more restrictive

evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better

and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [

RRS20

]: Natural Questions [

KPR

WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in

the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than

previous closed-book QA work: in addition to external content not being allowed, ﬁne-tuning on the Q&A dataset itself

is also not permitted.

The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the

one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the ﬁne-tuned T5-11B by

14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot

result improves by 3.7% and matches the SOTA for an open-domain QA system which not only ﬁne-tunes but also

makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [

LPP

GPT-3’s few-shot result further improves performance another 3.2% beyond this.

On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5%

in the few-shot setting. This compares to 37.4% for ﬁne-tuned T5-11B, and 44.7% for ﬁne-tuned T5-11B+SSM,

which uses a Q&A-speciﬁc pre-training procedure. GPT-3 in the few-shot setting approaches the performance of

state-of-the-art ﬁne-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to

few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions

Figure 3.3:

On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models

continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make signiﬁcant gains

over zero-shot behavior, matching and exceeding the performance of the SOTA ﬁne-tuned open-domain model, RAG

[LPP

20]

and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this

distribution, recovering strong performance in the few-shot setting.

On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in

the few-shot setting, compared to 36.6% for ﬁne-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot

to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to

TriviaQA and WebQS. In particular, the questions in NQs tend towards very ﬁne-grained knowledge on Wikipedia

speciﬁcally which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.

Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain ﬁne-tuning SOTA. On the other two

datasets it approaches the performance of the closed-book SOTA despite not using ﬁne-tuning. On all 3 datasets, we

ﬁnd that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reﬂecting

the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.

3.3 Translation

In collecting training data for GPT-3, we made no effort to select either for or against foreign languages, and instead

simply accepted the natural distribution of languages reﬂected in internet text datasets (primarily Common Crawl).

As a result, although GPT-3’s training data primarily consists of English (93% by word count), it also includes 7%

foreign language content. GPT-2 [

RWC

] previously found that a dataset with a very small fraction of foreign

language content could lead to promising initial results on translation in the few-shot setting, but was not competitive

with existing unsupervised machine translation approaches. Here we test the extent to which GPT-3 can improve these

abilities via greater capacity and a larger fraction of foreign languages in its training data.

Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets

with back-translation [

SHB15

] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a

blend of training data that mixes many languages together in a natural way, combining them on a word, sentence,

and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in

particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make

use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data.

Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task,

still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for

each translation task improves performance by over 7 BLEU and nears competitive performance with prior work.

GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior

剩余71页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

GPT-3：OpenAI引领小样本学习新纪元

超大型语言模型少样本学习（GPT-3作者亲解）

2005.14165.pdf Language Models are Few-shot Learners

《小样本自然语言处理的元学习》综述论文

GPT可能很快会成为OpenAI的商标

Contrastive-Inversion:使用对比学习和OpenAI的CLIP查找具有有损变换的图像的良好嵌入

20230424-电子设备-电子AI+系列专题报告（一）：AI大语言模型的原理、演进及算力测算-国信证券.pdf

所谓语言模型的训练和学习，就是从大量的数据中学习复杂的上下文联系

SkyPaint AI绘画模型：跨语言艺术风格图像生成

OpenAI 的自然语言生成技术探究

OpenAI 的迁移学习研究进展

最新资源