基于浏览器辅助的长文本问答模型，以人工反馈提高答案质量

人工智能

需积分: 1 18 浏览量更新于2024-06-25 收藏 1.27MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

Browser-assisted Question-Answering with Human Feedback Browser-assisted question-answering with human feedback 是一篇关于使用人工智能和 Web 浏览环境来回答长格式问题的研究论文。论文的作者使用了 GPT-3 模型，并对其进行了微调，以便在文本基于的 Web 浏览环境中回答问题。论文的主要贡献在于，作者提出了使用人类反馈来优化答案质量的方法。具体来说，就是通过让模型在 Web 浏览环境中搜索和浏览信息，然后使用人类反馈来评价答案的正确性。这种方法可以让模型更好地理解问题的上下文，并提供更加准确的答案。论文中使用了 ELI5 数据集，该数据集来自 Reddit 用户提出的问题。作者使用行为克隆和拒绝采样来训练模型，并使用人类反馈来评价答案的质量。实验结果显示，使用人类反馈的模型比人类演示者的答案要好56%。本论文的贡献在于： 1. 提出了使用 Web 浏览环境来回答长格式问题的方法，这可以让模型更好地理解问题的上下文。 2. 使用人类反馈来优化答案质量，这可以让模型提供更加准确的答案。 3. 实验结果显示，使用人类反馈的模型可以获得更好的答案质量。本论文的研究结果对自然语言处理和人工智能领域具有重要意义，可以应用于各种需要回答长格式问题的场景，例如客服聊天机器人、搜索引擎等。相关知识点： * GPT-3 模型：是一种基于 transformer 架构的语言模型，可以生成长格式的文本。 * 行为克隆：是一种机器学习算法，用于模仿人类的行为。 * 拒绝采样：是一种机器学习算法，用于选择最优的答案。 * 人类反馈：是指人类对模型答案的评价和反馈，以便优化答案质量。 * ELI5 数据集：是一种自然语言处理数据集，来自 Reddit 用户提出的问题。 * 文本基于的 Web 浏览环境：是一种使用 Web 浏览器来回答问题的方法，可以让模型更好地理解问题的上下文。本论文提出了使用 Web 浏览环境和人类反馈来回答长格式问题的方法，并获得了较好的实验结果。这项研究对自然语言处理和人工智能领域具有重要意义。

资源详情

资源推荐

Overall

usefulness

Coherence Factual

accuracy

GPT-Browser preferred (%)

760M best-of-4 13B best-of-16 175B best-of-64

Overall

usefulness

Coherence Factual

accuracy

WebGPT preferred (%)

(a) WebGPT vs. human demonstrations.

Overall

usefulness

Coherence Factual

accuracy

WebGPT preferred (%)

(b) WebGPT vs. ELI5 reference answers.

Figure 2: Human evaluations on ELI5 comparing against (a) demonstrations collected using our web

browser, (b) the highest-voted answer for each question. The amount of rejection sampling (the

best-of-

) was chosen to be compute-efﬁcient (see Figure 8). Error bars represent

±1

standard error.

Although the evaluations against the ELI5 reference answers are useful for comparing to prior work,

we believe that the evaluations against human demonstrations are more meaningful, for several

reasons:

• Fact-checking.

It is difﬁcult to assess the factual accuracy of answers without references:

even with the help of a search engine, expertise is often required. However, WebGPT and

human demonstrators provide answers with references.

• Objectivity.

The use of minimal instructions makes it harder to know what criteria are

being used to choose one answer over another. Our more detailed instructions enable more

interpretable and consistent comparisons.

• Blinding.

Even with citations and references stripped, WebGPT composes answers that

are different in style to Reddit answers, making the comparisons less blinded. In contrast,

WebGPT and human demonstrators compose answers in similar styles. Additionally, some

ELI5 answers contained links, which we instructed labelers not to follow, and this could

have biased labelers against those answers.

• Answer intent.

People ask questions on ELI5 to obtain original, simpliﬁed explanations

rather than answers that can already be found on the web, but these were not criteria we

wanted answers to be judged on. Moreover, many ELI5 questions only ever get a small

number of low-effort answers. With human demonstrations, it is easier to ensure that the

desired intent and level of effort are used consistently.

4.2 TruthfulQA

To further probe the abilities of WebGPT, we evaluated WebGPT on TruthfulQA [Lin et al., 2021], an

adversarially-constructed dataset of short-form questions. TruthfulQA questions are crafted such that

they would be answered falsely by some humans due to a false belief or misconception. Answers are

scored on both truthfulness and informativeness, which trade off against one another (for example, “I

have no comment” is considered truthful but not informative).

We evaluated both the base GPT-3 models used by WebGPT and the WebGPT models themselves

on TruthfulQA. For GPT-3, we used both the “QA prompt” and the “helpful prompt” from Lin

et al. [2021], and used the automated metric, since this closely tracks human evaluation on answers

produced by the GPT-3 model family. For WebGPT, we used human evaluation, since WebGPT’s

answers are out-of-distribution for the automated metric. TruthfulQA is a short-form dataset, so

760M 13B 175B 760M 13B 175B 760M

bo4

13B

bo16

175B

bo64

100

Human % truthful and informative

Human % truthful

WebGPT

GPT-3

(QA prompt)

GPT-3

(helpful prompt)

Truthful (%) Truthful and informative (%)

Figure 3: TruthfulQA results. The amount of rejection sampling (the

in best-of-

) was chosen to

be compute-efﬁcient (see Figure 8). Error bars represent ±1 standard error.

we also truncated WebGPT’s answers to 50 tokens in length, and then removed any trailing partial

sentences.

Our results are shown in Figure 3. All WebGPT models outperform all GPT-3 models (with both

prompts) on both the percentage of truthful answers and the percentage of truthful and informative

answers. Moreover, the percentage of truthful and informative answers increases with model size for

WebGPT, unlike GPT-3 with either prompt. Further qualitative analysis of WebGPT’s performance

on TruthfulQA is given in Section 6.1.

4.3 TriviaQA

We also evaluated the WebGPT 175B BC model on TriviaQA [Joshi et al., 2017]. These results are

given in Appendix G.

5 Experiments

5.1 Comparison of training methods

We ran a number of additional experiments comparing reinforcement learning (RL) and rejection

sampling (best-of-

) with each other and with the behavior cloning (BC) baseline. Our results are

shown in Figures 4 and 5. Rejection sampling provides a substantial beneﬁt, with the 175B best-of-64

BC model being preferred 68% of the time to the 175B BC model. Meanwhile, RL provides a smaller

beneﬁt, with the 175B RL model being preferred 58% of the time to the 175B BC model.

Even though both rejection sampling and RL optimize against the same reward model, there are

several possible reasons why rejection sampling outperforms RL:

•

It may help to have many answering attempts, simply to make use of more inference-time

compute.

•

The environment is unpredictable: with rejection sampling, the model can try visiting many

more websites, and then evaluate the information it ﬁnds with the beneﬁt of hindsight.

This inadvertently resulted in a small number of empty answers, which were considered truthful but not

informative. This affected 74 answers in total, around 3% of answers.

剩余31页未读，继续阅读

2013crazy

粉丝: 831
资源: 2334

基于浏览器辅助的长文本问答模型，以人工反馈提高答案质量

基于-web的在线答疑系统.doc

AI-Assisted Low Information Latency Wireless Networking

transforming-ris-assisted-passive-beamforming

Deep Reinforcement Learning Approach for UAV-Assisted Mobile Edge Computing Networks

heat-assisted detection and ranging

Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds的checkpoint怎么用

Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds是基于Transformer的吗？

查询以下文献的GB/T 7713.1-2006的标准格式，包含期、卷和起止页码： WANG Y, FANG W, DING Y. Computation offloading optimization for UAV-assisted mobile edge computing a deep deterministic policy gradient approach [J]. Wireless Networks, 2021, 27: 2991-3006.

Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds输入是点云输出是体素吗

Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds讲了什么

Wang G , Ourselin S , Vercauteren T , et al. Automatic Brain Tumor Segmentation using Cascaded Anisotropic Convolutional Neural Networks[C]// Medical Image Computing and Computer-Assisted Intervention. 2017.

Check that hardware-assisted virtualization (either Intel VMX or AMD SVM) and Data Execution Prevention (sometimes labeled XD or Execute Disable or NX) are enabled in your BIOS. Check your bootloader is configured to launch Hyper-V 如何解决

idea2024英文版

write an essay on the topic:should China legalize assisted suicide in English about 400 words

visual Studio2019

关于深度学习在医学成像应用的有关应用文献

关于ALD介绍的参考文献

介绍几种典型的硬件虚拟化技术

用python搭建u-net的参考文献

最新资源