Setting NaturalQS WebQS TriviaQA
RAG (Fine-tuned, Open-Domain) [LPP
+
20] 44.5 45.5 68.0
T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5
T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1
GPT-3 Zero-Shot 14.6 14.4 64.3
GPT-3 One-Shot 23.0 25.3 68.0
GPT-3 Few-Shot 29.9 41.5 71.2
Table 3.3: Results on three Open-Domain QA tasks.
GPT-3 is shown in the few-, one-, and zero-shot settings, as
compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the
wiki split test server.
One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA
dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact
on performance.
3.1.3 HellaSwag
The HellaSwag dataset [
ZHB
+
19
] involves picking the best ending to a story or set of instructions. The examples were
adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy).
GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the
75.4% accuracy of a fine-tuned 1.5B parameter language model [
ZHR
+
19
] but still a fair amount lower than the overall
SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM.
3.1.4 StoryCloze
We next evaluate GPT-3 on the StoryCloze 2016 dataset [
MCH
+
16
], which involves selecting the correct ending
sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot
setting (with
K = 70
). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [
LDL19
] but
improves over previous zero-shot results by roughly 10%.
3.2 Closed Book Question Answering
In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense
amount of possible queries, this task has normally been approached by using an information retrieval system to find
relevant text in combination with a model which learns to generate an answer given the question and the retrieved
text. Since this setting allows a system to search for and condition on text which potentially contains the answer it
is denoted “open-book”. [
RRS20
] recently demonstrated that a large language model can perform surprisingly well
directly answering the questions without conditioning on auxilliary information. They denote this more restrictive
evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better
and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [
RRS20
]: Natural Questions [
KPR
+
19
],
WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in
the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than
previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself
is also not permitted.
The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the
one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by
14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot
result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also
makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [
LPP
+
20
].
GPT-3’s few-shot result further improves performance another 3.2% beyond this.
On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5%
in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM,
which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of
state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to
few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions
13