人类偏好驱动的自然语言模型微调：应用与实证研究

版权申诉

5星 · 超过95%的资源 57 浏览量更新于2024-06-25 收藏 1.49MB PDF 举报

"Fine-Tuning Language Models from Human Preferences" 是一篇由 Daniel Ziegler、Nisan Stiennon 等来自 OpenAI 的研究者共同撰写的论文，他们专注于探索如何将人类偏好应用于语言模型的微调过程，以增强人工智能在现实世界任务中的实用性和安全性。该研究主要集中在利用强化学习（RL）的奖励学习机制，因为自然语言中蕴含了复杂的价值判断，这对于将RL扩展到真实环境至关重要。传统上，奖励学习的研究大多集中在模拟环境中，然而，自然语言提供了表达价值观的独特途径。作者们提出，通过利用生成式预训练语言模型的进展，可以直接针对自然语言任务进行奖励学习。他们选择了四个具体的任务来进行实验：一是风格化的文本续写，要求模型生成积极或具象的语言；二是基于 TL;DR 和 CNN/DailyMail 数据集的文本摘要，要求模型能准确提炼关键信息。在风格化文本续写任务中，研究人员展示了即使只使用了5,000个经人类评估的比较，模型也能展现出良好的性能，这表明了即使少量的人类反馈也能有效指导模型学习。对于文本摘要，他们进行了更为深入的训练，使用了60,000次人类对比，使得模型能够学会更准确地复制原文的主要观点，从而实现高质量的总结。这篇论文的核心贡献在于展示了如何通过结合大规模语言模型和人类偏好，让AI系统在处理自然语言时更加贴近人类的理解和期望，从而提升其在实际应用中的表现。这种方法不仅有助于强化学习在语言处理领域的拓展，也为确保AI系统的决策安全性和道德合理性提供了一种新的可能。通过细致的实验设计和分析，研究者们为我们理解如何在实际场景中有效利用人类知识来指导AI学习开辟了新的路径。

Fine-Tuning Language Models from Human Preferences

Figure 3: Allowing the policy

to move further from the initial policy

as measured by

KL(π, ρ)

achieves higher reward at

the cost of less natural samples. Here we show the optimal KL vs. reward for 124M-parameter mock sentiment (as estimated

by sampling), together with results using PPO. Runs used 2M episodes, except for the top series.

We release code

for reward modeling and ﬁne-tuning in

the ofﬂine data case. Our public version of the code only

works with a smaller 124M parameter model with 12 layers,

12 heads, and embedding size 768. We include ﬁne-tuned

versions of this smaller model, as well as some of the human

labels we collected for our main experiments (note that these

labels were collected from runs using the larger model).

3.1. Stylistic continuation tasks

We ﬁrst apply our method to stylistic text continuation tasks,

where the policy is presented with an excerpt from the Book-

Corpus dataset (Zhu et al., 2015) and generates a continu-

ation of the text. The reward function evaluates the style

of the concatenated text, either automatically or based on

human judgments. We sample excerpts with lengths of 32

to 64 tokens, and the policy generates 24 additional tokens.

We set the temperature of the pretrained model to

T = 0.7

as described in section 2.1.

3.1.1. MOCK SENTIMENT TASK

To study our method in a controlled setting, we ﬁrst apply it

to optimize a known reward function

designed to reﬂect

some of the complexity of human judgments. We construct

Code at https://github.com/openai/lm-human-preferences.

by training a classiﬁer

on a binarized, balanced subsam-

ple of the Amazon review dataset of McAuley et al. (2015).

The classiﬁer predicts whether a review is positive or nega-

tive, and we deﬁne

(x, y)

as the classiﬁer’s log odds that

a review is positive (the input to the ﬁnal sigmoid layer).

Optimizing

without constraints would lead the policy

to produce incoherent continuations, but as described in

section 2.2 we include a KL constraint that forces it to stay

close to a language model ρ trained on BookCorpus.

The goal of our method is to optimize a reward function

using only a small number of queries to a human. In this

mock sentiment experiment, we simulate human judgments

by assuming that the “human” always selects the continu-

ation with the higher reward according to

, and ask how

many queries we need to optimize r

Figure 2 shows how

evolves during training, using either

direct RL access to

or a limited number of queries to train

a reward model. 20k to 60k queries allow us to optimize

nearly as well as using RL to directly optimize r

Because we know the reward function, we can also ana-

lytically compute the optimal policy and compare it to our

learned policies. With a constraint on the KL divergence

The model is a Transformer with 6 layers, 8 attention heads,

and embedding size 512.

http://chat.xutongbao.top

Fine-Tuning Language Models from Human Preferences

Figure 4: Human evaluations comparing the zero-shot model with ofﬂine ﬁne-tuned models using varying amounts of

human data. We report how often the ﬁne-tuned model is preferred by a majority of 3 labelers. We omit error bars because

we lack an estimate of the largest source of variance (randomness across training runs).

Sentiment Descriptiveness

5k ofﬂine vs. zero-shot 88% 12% 86% 14%

5k ofﬂine vs. mock 77% 23% —

5k ofﬂine vs. 20k ofﬂine 48% 52% 47% 53%

5k ofﬂine vs. 5k online 50% 50% 48% 52%

Table 1: Human evaluations for the sentiment and descriptiveness tasks. We sample 1024 excerpts from the BookCorpus

test set and report how often each model’s continuations were preferred, as judged by a majority of 3 labelers.

KL(π, ρ)

between the learned policy

and the language

model ρ, the optimal policy has the form:

opt

(y|x) ∝ ρ(y|x)e

(x,y)/β

We approximate the reward of this policy for given

and

by sampling a large number of continuations from

ρ(y|x)

and reweighting them by

(x,y)/β

. Figure 3 compares the

reward obtained by our policies to the estimated optimal

reward across a range of KL values. There is a signiﬁ-

cant gap from optimality after training the policy on 2M

continuations—the number used in our main experiments—

though it is largely closed with more training. Our policies

continue to receive higher rewards for larger KL divergences,

where we cannot afford to approximate π

opt

by sampling.

3.1.2. HUMAN EVALUATIONS OF CONTINUATIONS

We apply our method to two continuation tasks deﬁned by

human judgments:

Sentiment:

Humans are asked to reward “positive and

happy” continuations.

Descriptiveness:

Humans are asked to reward “vividly de-

scriptive” continuations.

The human labelers are presented with a BookCorpus ex-

cerpt and four possible continuations; they are asked to

select the best continuation. Full instructions for labelers

are provided in appendix A (although labelers also learned

from

∼ 50

example comparisons labeled by the authors and

so the instructions do not completely deﬁne the task).

To make the labeling task more natural, we select excerpts

that start and end with a period. When sampling continu-

ations that will be presented to humans, we use rejection

sampling to ensure there is a period between tokens 16 and

24 and then truncate at that period.

During the RL ﬁne-

tuning, we penalize continuations that don’t have such a

period by giving them a ﬁxed reward of −1.

We dynamically adjusted

to obtain a KL divergence of

6 nats for descriptiveness and 10 nats for sentiment (sec-

tion 2.2).

We trained a range of models using different amounts of

feedback, testing both ofﬂine data collection where humans

This is a crude approximation for “end of sentence.” We chose

it because it is easy to integrate into the RL loop, and even a crude

approximation is sufﬁcient for the intended purpose of making the

human evaluation task somewhat easier.

http://chat.xutongbao.top

剩余25页未读，继续阅读

普通网友

粉丝: 1283

人类偏好驱动的自然语言模型微调：应用与实证研究

利用Prompt tuning优化下游任务fine-tuning的五参数技术

chess-tuning-tools-0.6.0b2 Python库发布，国际象棋调优新工具

VITS-fast-fine-tuning样例数据：模型训练准备与语音合成体验

LoftQ LoRA-Fine-Tuning-Aware Quantization for LLM.pdf

DNA-tuning汽车ECU改装公司简介.pdf

Prefix-Tuning Optimizing Continuous Prompts for Generation.pdf

Python库 | chess-tuning-tools-0.6.0b2.tar.gz

PyPI 官网下载 | chess-tuning-tools-0.7.0b2.tar.gz

PID-Parameters-Auto-Tuning-master (2).zip

Audio-Tuning-Tool-exe-v2.2052.rar

最新资源