用户反馈强化学习改进抽取式问答

84 浏览量更新于2023-12-01 收藏 19MB PDF 举报

强化学习

康奈尔大学

身份认证购VIP最低享 7 折!

30元优惠券

0模拟用户反馈对抽取式问答的强化学习0Ge Gao �，Eunsol Choi �和Yoav Artzi �0�康奈尔大学计算机科学系和康奈尔科技学院 �德克萨斯大学奥斯汀分校计算机科学系ggao@cs.cornell.edu eunsol@utexas.edu yoav@cs.cornell.edu0摘要0我们通过使用监督数据模拟用户反馈来研究从用户反馈中学习抽取式问答。我们将这个问题看作是上下文强化学习，并分析了几种学习场景的特征，重点是减少数据注释。我们展示了最初在少量示例上训练的系统在得到用户对模型预测答案的反馈后可以显著改进，并且可以使用现有数据集在新领域部署系统而无需任何注释，而是通过用户反馈实时改进系统。01 引言0NLP系统的用户明确反馈可以用于不断改进系统性能。例如，用户向问答（QA）系统提问时，可以标记预测的短语是否是根据其提取的上下文而言有效的答案。然而，NLP中的主导范式将模型训练与部署分离，使得模型在学习和与用户的交互过程中保持静态。这种方法错过了在系统使用过程中进行学习的机会，除了我们在第8节中讨论的几个例外情况外，在NLP中这方面的研究还不够。在本文中，我们通过模拟研究明确用户反馈对于抽取式问答的学习潜力。抽取式问答是语言推理的一个常用测试平台，有丰富的先前工作涉及数据集（例如，Rajpurkar等人，2016年）、任务设计（Yang等人，2018年；Choi等人，2018年）和模型架构开发（Seo等人，2017年；Yu等人，2018年）。与用户的交互学习仍然相对不够研究，尽管QA非常适合引发用户反馈。提取的答案可以在其支持上下文中清晰可见，并且语言熟练的用户可以轻松验证。0图1：学习从用户反馈中学习QA的交互设置及其潜力。给定一个用户问题，系统输出一个答案并在其上下文中突出显示。用户根据上下文验证答案，并给出二进制反馈。我们展示了我们在SQ UAD上进行的在线学习实验的性能进展，其中包含两个时间步的手工示例。0如果答案是否得到支持。这样可以提供简单的二进制反馈，并创建一个上下文强化学习的场景（Auer等人，2002年；Langford和Zhang，2007年）。图1说明了这种学习信号及其潜力。我们使用几个广泛使用的问答数据集来模拟用户反馈，并将其作为强化学习的信号进行学习。我们研究了学习过程的经验特征，包括其性能、对初始系统性能的敏感性以及在线和离线学习之间的权衡。我们还模拟了零注释领域适应，即通过用户反馈在新领域部署经过监督训练的问答系统，而无需任何注释。01答案也可能来自错误或欺骗性的上下文。这个重要问题在大多数抽取式问答的研究中都没有被研究，包括我们的研究。我们将其留给未来的工作。0arXiv:2203.10079v1[cs.CL]18Mar20220+v:mala2277获取更多论文0在一个领域中收集数据，并仅通过用户反馈在新领域中进行调整。这种学习场景可以减轻抽取式问答中的基本问题。它通过将大部分学习交给与用户的互动来降低数据收集成本。它可以避免数据收集产生的人为因素，因为数据来自实际系统部署，而不是来自注释工作，后者往往涉及与系统用例无关的设计决策。例如，共享问题和答案注释者角色（Rajpurkar等人，2016年），这对于模拟信息寻求行为是有害的（Choi等人，2018年）。最后，它使系统有可能随着世界的变化而不断发展（Lazaridou等人，2021年；Zhang和Choi，2021年）。我们的模拟实验表明，用户反馈是持续改进跨多个基准的问答系统的有效信号。例如，一个最初使用少量SQ UAD（Rajpurkar等人，2016年）注释（64个示例）训练的系统的F1分数从18提高到81.6，通过用户反馈将SearchQA（Dunn等人，2017年）系统调整到SQ UAD，其F1分数从45提高到84。我们的研究展示了初始系统性能、在线和离线学习之间的权衡以及源领域对适应性的影响。这些结果为未来的工作奠定了基础，超越了模拟，利用人类用户的反馈来改进抽取式问答系统。我们的代码公开可用于https://github.com/lil-lab/bandit-qa。02 学习和交互情景0我们研究了一个问答模型从显式用户反馈中学习的情景。我们将学习形式化为一个情境强化学习问题。学习者的输入是一个问题-上下文对，其中上下文段落包含问题的答案。输出是上下文段落中作为问题答案的一个跨度。给定一个问题-上下文对，模型预测一个答案跨度。然后用户提供关于模型预测答案的反馈，用于更新模型参数。我们有意地使用简单的二进制反馈和基本的学习算法进行实验，以提供尽可能少的假设的基线。0背景：情境强化学习在一个随机（独立同分布）的情境强化学习问题中，每个时间步t，学习者独立观察一个从数据分布 D中采样的上下文 2 x(t) � D，根据策略 π选择一个动作 y(t)，并观察到一个奖励 r(t) ∈R。学习者只观察到与选择的动作 y(t)相对应的奖励r(t)。学习者的目标是最小化累积遗憾。直观地说，遗憾是学习者在特定时间步骤上相对于最优策略所遭受的亏损。形式上，时间 T的累积遗憾是相对于最优策略 π � ∈ arg max π ∈Π E(x,y,r) � (D,π) [r] 计算的：0R T :=0t =1 r � (t)−0t =1 r(t)，(1)0其中 Π 是所有策略的集合，r(t) 是时间 t观察到的奖励，r � (t) 是最优策略 π �观察到的奖励。最小化累积遗憾等价于最大化总奖励。在情境强化学习中，一个关键挑战是平衡探索和利用，以最小化总体遗憾。0情景建模：假设问题 ¯ q 是一个由 m 个标记 � q1, .. . , qm � 组成的序列，上下文段落 ¯ c 是一个由 n个标记 � c1, . . . , cn �组成的序列。一个抽取式问答模型 π预测一个在上下文 ¯ c 中的答案跨度 ˆ y = � ci, . . . ,cj �，其中 i, j ∈ [1, n] 且 i ≤ j。当适用时，我们用π θ 表示由 θ参数化的问答模型。我们将学习形式化为一个情境强化学习过程：在每个时间步t，模型接收一个问题-上下文对 (¯ q(t), ¯c(t))，预测一个答案跨度 ˆ y，并接收一个奖励 r(t)∈ IR。学习者的目标是最大化总奖励 � T t=1r(t)。这个形式化表达了一个情景，即在给定一个问题-上下文对的情况下，问答系统与用户进行交互，用户在上下文中验证模型预测的答案，并提供映射到数值奖励的反馈。02 这里的 term context指的是学习者策略的输入，与后面在抽取式问答中使用的 termcontext 不同，后者指的是作为模型输入的证据文档。3等价地，该问题通常被形式化为损失最小化（Bietti etal.，2018）。4 在强化学习文献中，更常用的术语是policy。我们从这里开始使用术语 model来与问答文献保持一致。0+v:mala2277获取更多论文We study online and ofﬂine learning, also re-ferred to as on- and off-policy. In online learning(Algorithm 1), the model identity is maintained be-tween prediction and update; the parameter valuesthat are updated are the same that were used to gen-erate the output receiving reward. This entails thata reward is only used once, to update the modelafter observing it. In ofﬂine learning (Algorithm 2),this relation between update and prediction doesnot hold. The learner observes reward, often acrossmany examples, and may use it to update the modelmany times, even after the parameters drifted arbi-trarily far from these that generated the prediction.In practice, we observe reward for the entire lengthof the simulation (T steps) and then update forE epochs. The reward is re-weighted to providean unbiased estimation using inverse propensityscore (IPS; Horvitz and Thompson, 1952). We clipthe debiasing coefﬁcient to avoid amplifying exam-ples with large coefﬁcients (line 10, Algorithm 2).In general, ofﬂine learning is easier to implementbecause updating the model is not integrated withits deployment. Ofﬂine learning also uses a train-ing loop that is similar to optimization practices insupervised learning. This allows to iterate over thedata multiple times, albeit with the same feedbacksignal on each example. However, online learningoften has lower regret as the model is updated aftereach interaction. It may also lead to higher overallperformance, because as the model improves earlyon, it may observe more positive feedback overall,which is generally more informative. We empiri-5Early experiments showed that sampling is not as bene-ﬁcial as arg max, potentially because of the relatively largeoutput space of extractive QA. Yao et al. (2020) made a similarobservation for semantic parsing, and Lawrence et al. (2017)used arg max predictions for bandit learning in statisticalmachine translation. Table 4 in Appendix A provides ourexperimental results with sampling.Algorithm 2 Ofﬂine learning.1: for t = 1 · · · T do2:Receive a question ¯q(t) and context ¯c(t)3:Predict an answer ˆy(t) ← arg maxy πθ(y | ¯q(t), ¯c(t))4:p(t) ← πθ(ˆy(t) | ¯q(t), ¯c(t))5:Observe a reward r(t)6: end for7: for E epochs do8:for t = 1 · · · T do9:Compute clipped importance-weighted reward ac-cording to the current model parameters:10:r′ ← clip( πθ(ˆy(t)|¯q(t),¯c(t))p(t), 0, 1)r(t)11:Update the model parameters θ using the gradientr′∇θ log πθ(ˆy(t) | ¯q(t), ¯c(t))12:end for13: end forcally study these trade-offs in Section 5 and 6.Evaluating PerformanceWe evaluate modelperformance using token-level F1 on a held-out testset, as commonly done in the QA literature (Ra-jpurkar et al., 2016). We also estimate the learnerregret (Equation 1). Computing regret requires ac-cess to the an oracle π∗. We use human annotationas an estimate (Section 3).6Comparison to Supervised LearningIn super-vised learning, the data distribution is not depen-dent on the model, but on a ﬁxed training set{(¯q(t), ¯c(t), y(t))}Tt=1. In contrast, bandit learnersare provided with reward data that depends onthe model itself: {(¯q(t), ¯c(t), ˆy(t), r(t))}Tt=1 wherer is the reward for the model prediction ˆy(t) =arg maxy πθ(y | ¯q(t), ¯c(t)) at time step t. Suchfeedback can be freely gathered from users inter-acting with the model, while building superviseddatasets requires costly annotation. This learningsignal can also reﬂect changing task properties(e.g., world changes) to allow systems to adapt, andits origin in the deployed system use makes it morerobust to biases introduced during annotation.0算法 1 在线学习。01: 对于 t = 1 ∙ ∙ ∙ ，执行以下操作：2: 接收问题 ¯ q(t) 和上下文 ¯ c(k)03: 预测一个答案 ˆ y(t) ← arg max y π θ(y | ¯ q(t), ¯ c(t))04: 观察奖励 r(t)05: 使用梯度 r(t) � θ log π θ(ˆ y(t) | ¯ q(t), ¯ c(t))更新模型参数 θ06: 结束循环0学习算法我们使用策略梯度进行学习。我们的学习者类似于REINFORCE（Sutton andBarto，1998；Williams，2004），但我们使用 arg max来预测答案，而不是从模型的输出分布中进行蒙特卡洛采样。503模拟设置0我们使用监督数据初始化模型，然后使用监督数据注释模拟bandit反馈。初始化是关键，这样模型就不会返回随机答案，因为由于大的输出空间，随机答案很可能都是错误的。我们在同一领域使用相对较少的监督数据进行域内实验（第5节和第6节），以便专注于数据注释的产生潜力。对于领域适应，我们假设在源领域有大量的训练数据，并且在目标领域没有注释数据（第7节）。06我们的oracle是一个估计值，因为注释噪声和确切跨度选择的歧义性。0+v:mala2277获取更多论文examples from prior work (Ram et al., 2021).10+v:mala2277获取更多论文0为了用户反馈的数据注释潜力，我们假设在源领域有大量的训练数据，并且在目标领域没有注释数据（第7节）。0奖励我们使用监督数据注释来模拟奖励。如果预测的答案跨度在索引上与注释的跨度完全匹配，学习者观察到1.0的正奖励，否则观察到-0.1的负奖励。7这个奖励信号比QA评估指标（标记级F1或归一化后的精确匹配）更严格。80噪声模拟我们通过奖励扰动来模拟噪声反馈：以固定的概率（8%或20%）随机翻转二进制奖励。904实验设置0数据我们使用六个英文QA数据集，这些数据集提供了大量的注释训练数据，取自MRQA的训练部分（Fisch等，2019）：SQ UAD（Rajpurkar等，2016），NewsQA（Trischler等，2017），SearchQA（Dunn等，2017），TriviaQA（Joshi等，2017），Hot-potQA（Yang等，2018）和NaturalQues-tions（NQ；Kwiatkowski等，2019）。MRQA基准简化了所有数据集，使得每个示例都有一个具有有限证据文档长度（截断为800个标记）的单个跨度答案。附录B的表7提供了数据集的详细信息。我们按照之前的工作（Rajpurkar等，2016；Ram等，2021）在开发集上计算性能指标和学习曲线。0模型我们使用预训练的SpanBERT模型（Joshi等，2020）进行实验。我们在初始学习和模拟过程中微调预训练的SpanBERT-base模型。0实现细节我们使用Hugging FaceTransformers（Wolf等，2020）。当使用相对较少的领域内监督数据（第5节；第6节）训练初始模型时，我们使用学习率3e-5，线性调度，批大小10和10个时期。我们从之前的工作（Ram等，2021）中获取64、256或1,024个示例。1007我们尝试了其他奖励值，但没有观察到显著的性能差异（附录A）。8归一化包括小写化、修改间距、删除冠词和标点等。NaturalQuestions（NQ；Kwiatkowski等，2019）是一个例外，它具有类似严格度的精确索引匹配度量。9即使没有我们的噪声模拟，模拟的反馈也会继承注释的噪声，无论是来自众包还是远程监督。0对于最初在完整数据集上进行训练的模型（第7节），我们使用学习率2e-5，线性调度，批大小40和4个时期。在模拟实验中，我们使用批大小40。我们关闭了dropout以模拟在部署中与用户的交互。对于单次在线学习实验（第5节；第7节），我们使用恒定的学习率1e-5。对于离线学习实验（第6节），我们使用线性调度学习率3e-5，在收集的反馈上训练模型3个时期。使用SQ UAD、HotpotQA、NQ和NewsQA进行的在线实验每个需要2-4小时，在一块NVIDIA GeForce RTX2080Ti上；离线实验需要2.5-6小时。对于TriviaQA和SearchQA，每个在线模拟实验在一块NVIDIATITANRTX上需要4-9.5小时；离线实验需要9-20小时。05 在线学习0我们模拟了一种只有有限监督数据可用的情况，模型主要通过对预测答案的显式用户反馈进行学习。我们使用64、256或1024个领域内注释示例来训练初始模型。本节重点介绍在线学习，在每次观察到反馈后，学习者更新模型参数（算法1）。0图2展示了在线学习中领域内模拟的性能。不同数据集的性能模式各不相同。Bandit学习在SQuAD、HotpotQA和NQ上始终提高性能，无论用于训练初始模型的监督数据量如何。性能增益在初始模型较弱（即在64个监督示例上训练）时更大：SQuAD上为63.6，HotpotQA上为42.7，NQ上为40.0。Bandit学习在NewsQA、TriviaQA和SearchQA上并不总是有效，尤其是在初始模型较弱时。这可能归因于训练集注释的质量，在我们的设置中决定了奖励的准确性。SearchQA和TriviaQA使用远程监督来匹配问题和相关上下文，可能会降低我们设置中的奖励质量。虽然NewsQA是众包的，但Trischler等人（2017）报告了相对较低的人类性能（69.4F1），可能表明数据挑战也降低了我们的奖励质量。学习进展-010我们使用公开可用的46个种子集，网址为https://github.com/oriram/splinter。81.667.561.81.117.53.118.024.821.820.616.234.4SQuADHotpotQANQNewsQATriviaQASearchQA02040608010082.067.664.553.120.668.463.752.447.842.726.757.185.270.567.956.362.170.378.066.261.855.134.265.092.380.081.271.078.683.892.380.081.271.078.683.86464 w .0864 w .2256256 w .08256 w .210241024 w .081024 w .200.250.50.751·105020406080SQuAD00.250.50.751·105020406080HotpotQA00.250.50.751·105020406080NQ00.250.50.751·105020406080NewsQA00.250.50.751·105020406080TriviaQA00.250.50.751·105020406080SearchQASQuADHotpotQANQNewsQATriviaQASearchQA78.2(-3.4)66.3(-1.2)51.3(-10.5)3.1(+2.0)0.4(-17.1)1.3(-1.8)86.2(+4.2)70.9(+3.3)65.2(+0.7)54.3(+1.2)12.3(-8.3)0.3(-68.1)86.5(+1.3)73.2(+2.7)71.8(+3.9)55.7(-0.6)7.5(-54.6)4.1(-66.2)SQuADHotpotQANQNewsQATriviaQASearchQA0.63 / 1.040.51 / 0.940.74 / 0.911.07 / 0.860.77 / 0.771.09 / 0.770.56 / 0.750.36 / 0.580.71 / 0.830.84 / 0.850.76 / 0.720.73 / 0.690.48 / 0.550.27 / 0.330.65 / 0.670.73 / 0.710.71 / 0.640.69 / 0.65+v:mala2277获取更多论文064 64+模拟 256 256+模拟 1024 1024+模拟0图2：在线领域内模拟开发F1性能。水平灰线表示每个数据集上的监督训练性能。红色的数据标签表示在64、256或1024个示例上训练的初始模型的性能（即较浅的柱形）。较深的柱形和黑色的数据标签表示模拟性能。模拟性能较低（例如NewsQA 64+模拟）表示模拟后性能下降。0图3：在线领域内模拟开发F1学习曲线。X轴表示观察到反馈的示例数量。“x wy”表示最初使用x个监督领域内示例进行训练，并使用y量的反馈噪声进行模拟。0设置064+模拟0256+模拟01024+模拟0表1：离线领域内模拟开发F1性能。括号中的数字显示离线学习相对于在线学习的性能增益（绿色）或降低（红色）（图2）。0设置064+模拟0256+模拟01024+模拟0表2：在线/离线领域内模拟中反馈观察次数的平均遗憾值。SensitivityAnalysisTrainingTransformer-based models has been shown to have stabilityissues, especially when training with limitedamount of data (Zhang et al., 2021). Our non-standard training procedure (i.e., one epoch with aﬁxed learning rate) may further increase instability.We study the stability of the learning processusing initial models trained on only 64 in-domainsupervised examples on HotpotQA and TriviaQA:the former shows signiﬁcant performance gainwhile the latter shows the opposite. We experimentwith ﬁve initial models trained on different sets of64 supervised examples, each used to initiate aseparate simulation experiment. Four out of ﬁveexperiments on HotpotQA show performance gainssimilar to what we observed so far, except oneexperiment that starts with very low initializationperformance. In contrast, nearly all experimentson TriviaQA collapse (mean F1 of 7.3). We alsoconduct sensitivity analysis with stronger initialmodels trained with 1,024 examples, and observethat the ﬁnal performance is stable across runs onboth HotpotQA and TriviaQA (standard deviationsare 0.5 and 2.6). Table 5 in Appendix B providesdetailed performance numbers.6Ofﬂine LearningWe simulate ofﬂine bandit learning (Algorithm 2),where feedback is collected all at once with theinitial model. The learning scenario follows theprevious section: only a limited amount of super-vised data is available (64, 256, or 1,024 in-domainexamples) to train initial models.Table 1 shows the performance of ofﬂine simu-lation experiments compared to online simulations.We observe mixed results. On SQUAD, HotpotQA,NQ, and NewsQA, ofﬂine learning outperformsonline learning when using stronger initial mod-els (i.e., models trained on 256 and 1,024 exam-ples). This illustrates the beneﬁt of the more stan-dard training loop, especially with our Transformer-based model that is better optimized with a linearlearning rate schedule and multiple epochs, bothincompatible with the online setup. On TriviaQAand SearchQA, ofﬂine simulation is ineffective re-gardless of the performance of initial models. Thisresult echoes the learning challenges in the onlinecounterparts on these two datasets.Online vs. Ofﬂine RegretTable 2 compares on-line and ofﬂine regret. Regret numbers are aver-aged over the number of feedback observations.11Online learning generally displays lower regret forsimilar initial models on SQUAD, HotpotQA, andNQ. This is expected because later interactions inthe simulation can beneﬁt from early feedback inonline learning. In contrast, in our ofﬂine scenario,we only update after seeing all examples, so regretnumbers depend on the initial model only. Re-gret results on NewsQA, TriviaQA, and SearchQAare counterintuitive, generally showing that onlinelearning has similar or higher regret. The casesshowing signiﬁcantly higher online regret (64+simon NewsQA and SearchQA) can be explained bythe learning failing, which impacts online regret,but not our ofﬂine regret. The others are more com-plex, and we hypothesize that they may be becauseof combination of (a) inherent noise in the data;and (b) in cases where online learning is effective,the gap between the strictly-deﬁned reward that isused to compute regret and the relaxed F1 evalua-tion metric. Further analysis is required for a moreconclusive conclusion.7Domain AdaptationLearning from user feedback creates a compellingavenue to deploy systems that target new domainsnot addressed by existing datasets. The scenario wesimulate in this section starts with training a QAmodel on a complete existing annotated dataset,and deploying it to interact with users and learnfrom their feedback in a new domain. We do notassume access to any annotated training data in11Table 8 in Appendix B lists the percentage of positivefeedback in online and ofﬂine in-domain simulation.0通过跨数据集的比较（图3），我们发现使用1024个样本进行初始模型训练时，只提供三分之一甚至四分之一的反馈即可达到峰值性能。0反馈噪声模拟图3显示了通过不同程度的反馈扰动（0％，8％或20％）进行的模拟噪声的学习曲线。当无扰动模拟有效时，模型对噪声保持稳健：8％的噪声导致学习曲线的小幅波动，但最终性能几乎不会下降。使用较弱的初始模型并以更高的噪声比率进行学习可能导致学习失败（例如在64个初始示例和20％噪声上进行的SQuAD模拟）。当在线无扰动模拟失败时，带有噪声反馈的在线学习也会失败。0+v:mala2277获取更多论文92.380.081.271.078.683.892.380.081.271.078.683.8020406080100727662684616760682463635970359595263348875616848357586747897273656856557683987687358686546210142568467724572453718287100.250.50.75102040608010000.250.50.75102040608010000.250.50.75102040608010000.250.50.75102040608010000.250.50.75102040608010000.250.50.7510204060801000SQuAD NQ HotpotQA NewsQA TriviaQA SearchQA0SQuAD HotpotQA NQ NewsQA TriviaQA SearchQA0图4：在线领域适应模拟发展F1性能。水平灰线表示每个完整数据集上的监督训练性能。条形颜色表示源域。红色标签表示目标域上初始模型的性能（x轴）。实色和黑色标签表示目标域上的模拟性能。0SQuAD HotpotQA NQ NewsQA TriviaQA SearchQA0∙ 10 50SQuAD0∙ 10 50HotpotQA0∙ 10 50NQ0∙ 10 50NewsQA0∙ 10 50TriviaQA0∙ 10 50SearchQA0图5：在线领域适应模拟发展F1学习曲线。X轴是观察到的反馈示例数。颜色表示源

下载后可阅读完整内容，剩余1页未读，立即下载

cpongm

粉丝: 5
资源: 2万+

用户反馈强化学习改进抽取式问答

基于Transformer和Bert的close domain抽取式问答系统构建的智能聊天机器人源码（高分项目）

基于Python的抽取式文本自动摘要的实现.zip

基于深度学习的中文抽取式摘要方法应用.pdf

DQN中在线强化学习与离线强化学习

代码实现BERT抽取式阅读理解

多智能体的强化学习的经验回访

深度强化学习能保持稳定性吗

强化学习训练ai玩游戏

NLP大模型问答系统 实体抽取

强化学习DQN之俄罗斯方块

一、实验目的 运用BERT预训练模型实现抽取式阅读理解。 二、实验要求 1、理解BERT抽取式阅读理解的模型框架及原理； 2、代码实现BERT抽取式阅读理解

问答系统的国内外研究现状

PFN关系抽取模型如何改进

深度强化学习中有几个经验池

多智能体强化学习经验池程序

强化学习调取别的经验采样

强化学习DQN代码流程图

基于抽取式的新闻标题生成器

刘焕勇基于医疗知识图谱的智能问答系统的目录结构

基于强化学习的自举式关系抽取算法

最新资源

NLP大模型问答系统实体抽取

一、实验目的运用BERT预训练模型实现抽取式阅读理解。二、实验要求 1、理解BERT抽取式阅读理解的模型框架及原理； 2、代码实现BERT抽取式阅读理解