WebQA：多跳和多模态问答

74 浏览量更新于2023-10-25 收藏 51.1MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

tivals are held in manyplaces in Japan, mainlyalong shopping malls andstreets, which are deco-e, colorfulheldAugIn the summer, the SendaiTanablargesin Jawinterdecorated with thousandsfhrted peopleland to thedalQAion.s isofIners.theetsionlti-iondal-verithomnksinsess isoberfest in Domplatz, Austria oraThe festival is a "SyonanHiratsukaTanabataMatsuri".Large-scale Tanabata fes-tivals are held in manyplaces in Japan, mainlyalong shopping malls andstreets, which are deco-rated with large, colorfulstreamers.The most fa-mous Tanabata festival isheld in Sendai from 6 to 8August.In the summer, the SendaiTanabataFestival,thelargest Tanabata festivalin Japan,is held.Inwinter,thetreesaredecorated with thousandsof lights for the Pageant ofStarlight, lasting throughmost of December.Fussa Tanabata Festival-TokyoFour mugs ofFortheOktoberfest,LöwenbräubrewsaspecialMärzenbeercalled Oktoberfestbier orier("meadowGhost train on the MunichOktoberfest.In 1938, after Hitler hadannexed Austria and wonthe Sudetenland via theMunich Agreement, Ok-toberfest was renamed toGroßdeutschesVolksfest(GreaterGermanfolkfestival), and as a showingofstrength,theNaziregime transported peoplefrom Sudetenland to theWiesn by the score.atz, Austria.and c) generate answers in natural language. We adapt state-of-the-art multi-modalur dataset, whose failures indicate promising directions for future research.elated Workion answering.” For example, VQAdely studied tasks at the intersection of language and vision.A models should be adapted to open-domain scenarios. This iscation of answers into classiﬁcation over a ﬁxed vocabulary ofvideo [7, 8, 9] has also adopted a multiple-choice format. Intask to knowledge-seeking questions with open-ended answers.45OK-VQA and our task differ in the role of images. Images in OK-VQA are regarded as part of the46urce that can only be processed after retrieval.unity, a similar transition has been occurring, as QA datasetsd span prediction to the harder free-form answer generationering has recently taken the spotlight as it aligns with the multi-reasoning during knowledge acquisition leading to a proliferation[10], HotpotQA [11] and ComplexWebQuestions [12].hmarks for reasoning over input and contexts in multiple modal-e ﬁrst foray into complex questions that require reasoning oversses models that boast their cross-modal reasoning ability withn. However, questions in MultiModalQA are generated fromestion template is detected the task reduces to ﬁlling in blanksechanisms.a testbed for question answering where the answers can lie inmages and tables. The primary challenge their design addressesrather than knowledge aggregation or extraction. Our focus is2tz, Austria orThe festival is a "SyonanHiratsukaTanabataMatsuri".Large-scale Tanabata fes-tivals are held in manyplaces in Japan, mainlyalong shopping malls andstreets, which are deco-rated with large, colorfulstreamers.The most fa-mous Tanabata festival isheld in Sendai from 6 to 8August.In the summer, the SendaiTanabataFestival,thelargest Tanabata festivalin Japan,is held.Inwinter,thetreesaredecorated with thousandsof lights for the Pageant ofStarlight, lasting throughmost of December.Fussa Tanabata Festival-TokyoGhost train on the MunichOktoberfest.In 1938, after Hitler hadannexed Austria and wonthe Sudetenland via theMunich Agreement, Ok-toberfest was renamed toGroßdeutschesVolksfest(GreaterGermanfolkfestival), and as a showingofstrength,theNaziregime transported peoplefrom Sudetenland to theWiesn by the score.uage. We adapt state-of-the-art multi-modalmising directions for future research.“question answering.” For example, VQAat the intersection of language and vision.e adapted to open-domain scenarios. This isnto classiﬁcation over a ﬁxed vocabulary ofalso adopted a multiple-choice format. In-seeking questions with open-ended answers.ages in OK-VQA are regarded as part of thee processed after retrieval.nsition has been occurring, as QA datasetsto the harder free-form answer generationaken the spotlight as it aligns with the multi-owledge acquisition leading to a proliferation1] and ComplexWebQuestions [12].ng over input and contexts in multiple modal-mplex questions that require reasoning overast their cross-modal reasoning ability withtions in MultiModalQA are generated fromdetected the task reduces to ﬁlling in blankstion answering where the answers can lie inhe primary challenge their design addressesdge aggregation or extraction. Our focus israted with large, colorfulstreamers.The most fa-mous Tanabata festival isheld in Sendai from 6 to 8August.In the summer, the Sendaidecorated with thousandsFussa Tanabata Festival-TokyoMasskruege Four mugs ofbeer at Oktoberfest 2008.called Oktoberfestbier orWiesenbier("meadowbeer,"referringtotheBavaGhost train on the MunichOktoberfest.In 1938, after Hitler hadannexed Austria and wonthe Sudetenland via theMunich Agreement, Ok-toberfest was renamed toGroßdeutschesVolksfestfrom Sudetenland to theWiesn by the score.-art mul3transformers to our dataset, whosetions for future research.382Related Work39ple, VQAnd vision.4243frequent answers. Recent work on video [7, 8, 9] hase-choice format. In44hop nature of how humans perform reasoning during knowledge acquisition leading to a proliferation51The festival is a "Syonanplalong shopping malls andstreets, which are deco-In the summer, thTanabataFestivlargest Tanabata festivalfhn on the Mst.toberfest was renamed toGroßdeutschesVolksfestaturalndicate2Related Work39Many datasets and tasks can be broadly conside40[3, 4, 5, 6] is one of the most widely studied ta41ls shouf answ7, 8, 9knowlemagescan onsimilapredicts recedurinotpotQfor rearay intels thever,mplats.d fornd tablis the choice of answer modality – rather than knowledge aggregation or extraction. Our focus is612Large-scale Tanabata fes-tivals are held in manypastreets, which are deco-rated with large, colorfulIn the summer, the SendaiTanabataFestival,thelargest Tanabata festivalis held.InMasskruege Four mugs ofbeer at Oktoberfest 2008.Märzenbeercalled Oktoberfestbier orWiesenbier("meadowbeer,"referringtotheBavariannameofthefestival site, the "Wiesed toGroßdeutschesVolksfest(GreaterGermanfolkWiesn by the score.either modality, and c) generate answers in natural languhe-art multi-modal3733441Nevertheless, it is unclear how VQA models should be adapted to open-domain scenarios. This is424444555555555566Matsuri".tivals are held in manyplaces in Japan, mainlyalong shopping malls andstreets, which are deco-rated with large, colorfulmost fa-In the summer, the Sendaiwinter,thetreesare1938, afternexed Austre Sudetenlaunich Agreberfest wasroßdeutschesVolksfestreaterGermanfolkstival), and astrength,gime transpom Sudetenliesn by the sge. Weing dirMany datasets and tasks can be broadly considered “question40, 4, 5, 6] is one of the most widely studied tasks at the intevertheless, it is unclear how VQA models should be adapted trgely due to VQA tasks’ simpliﬁcation of answers into classiﬁso adoking qs in Orocessetion hathe harn the sledge acand Cover inplex quetheir cs in Mcted thansweprimare aggreld in manypan, mainlyg malls andh are deco-rge, colorfulhe most fa-ta festival isi from 6 to 8In the summer, the SendaiTanabataFestival,thelargest Tanabata festivalin Japan,is held.Inwinter,thetreesareart multi-esearch.2Related Work39and tasks can be broadly considered “question answering.” For example,40one of the most widely studied tasks at the intersection of language andit is unclear how VQA models should be adapted to open-domain scenarios.VQA tasks’ simpliﬁcation of answers into classiﬁcation over a ﬁxed vocabulice form-ended and as partas QA dwer genewith thea prolifns [12].multiplereasonining abilitgeneratedlling inwers canesign addn. Our fois a "SyonanabataMatsuri".Oktoberfest is a Germanfestival dating from 1810,and Oktoberfestbiers aret have beenIn the summer, the Sendaidecorated with thousandsof lights for the Pageant ofStarlight, lasting throughmost of December.FortheOktoberfest,Löwenbräubrewsafestival site, the "Wiesn").TheCowherdandtheWeaverGirloriginatedFestivalsincetheHanDynasty. It has also beens the Tanabataapan and thetival in Korea.mainlyalls ande deco-colorfulstreamers.The most fa-mous Tanabata festival isheld in Sendai from 6 to 8August.ribute.tivals are held in manyJapan, mainlyping malls andfestival isfrom 6 to 8August.In the summer, the SendaiTanabataFestival,theabata festivalis held.Inwinter,thetreesaredecorated with thousandsOktoberfest,festival site, the "Wiesn").fesofstrength37transformers to our dataset, whose failures indicate promising directions for future research.3839434445464748t4950515253i545556575859t60i61Matsuri".Large-scale Tanabata fes-ping malls andich are deco-large, colorfulstreamers.The most fa-mous Tanabata festival isheld in Sendai from 6 to 8In the summer, the SendaiTanabataFestival,thelargest Tanabata festivalin Japan,is held.Intreesareth thousandse Pageant ofting throughmost of December.Fussa Tanabata Festival-Tokyoofstrength,theNaziregime transported peoplefrom Sudetenland to theulti-modalrch.ple, VQAand vision.ios. This iscabulary offormat. Ined answers.part of theA datasetsgenerationh the multi-roliferation2].iple modal-oning overbility withrated fromg in blankss can lie inn addressesur focus is164950WebQA：多跳和多模态问答0Yingshan Chang 1 Mridu Narang 2 Hisami Suzuki 20Guihong Cao 2 Jianfeng Gao 3 Yonatan Bisk 1 , 301 卡内基梅隆大学 2 微软，必应搜索 3 微软研0摘0将视觉问答（VQA在视觉表示学习、知识聚合和语言生成方面进行基本的进展。在这项工作中，我们介绍了一个具有挑战性的新基准 WEBQA，对于最先进模型来说非常困难，但对于人类来说却很简单。WEBQA模拟了人类使用的来源，3）生成流畅的语言回答。这是我们和数字助手的期望行为。现有的工作倾向于假设模型可以在图像或文本EBQA包括一个次要的仅文本问答任觉性能不会以语言理解的代价为代价。我们对社区的挑战是创建统一的多模态推理模型，无论来源模态如何，都能回答问题，使我们更接近不仅查询语言知识，而且查询更丰富的视觉在线世界的数字助01. 简介0网络搜索是一种多模态体验：我能在图像搜索标签上找到答案，还是在文本片段中找到答案？相比之下，大多数部署的问答（QA）系统将网络视为仅包含要提取的事实的文本领域，忽略了图像中存在的知识。这有两个基本限制：1.基于文本的网络贫乏[3,4]，2.这种信息提取形式效率低下。例如，当搜索一个公园是否有野餐桌时，显示野餐区域的图像立即回答了问题，而不是翻阅希望有人碰巧提到这个事实的评论页面。问答引擎需要将互联网视为一个多模态的信息宝库，但这需要对图像或文本进行多跳推理。0问：在哪个节日可以在背景中看到一座城堡；在奥地利多姆广场的啤酒节还是日本平塚的七夕节？0J24 029 Dom，啤酒节0平塚七夕节010答：在奥地利多姆广场的啤酒节上可以在背景中看到一座城堡。0利多姆广场的啤酒节还是日本平塚的七夕节0夕节010。0问：在哪个节日可以在背景中看到一座城堡；在奥地利多姆广场的啤酒节还是日本平塚的七夕节？0J24 029 Dom，啤酒节0平塚七夕节010答：在奥地利多姆广场的啤酒节上可以在背景中看到一座城堡。0问：在哪个节日可以在背景中看到一座城堡；在奥地0啤酒节0J24 029 Dom, Oktobe0Q: 在哪个节日可以在背景中看到一座城堡：德国慕尼黑啤酒节（Oktoberfest）的多姆广场（Domplafestival）？0J24 029 Dom, Oktoberfest0日本平塚的七夕节010A: 你可以在奥地利的慕尼黑啤酒节（Oktoberfest）的多姆广场（Domplatz）背景中看到0J24 029 Dom, Oktoberfest0日本平塚的七夕节010WebQA：多跳和多模态问答0所属地址电0J24 029 Dom, Oktoberfest0日本平塚的七夕节010J24 029 Dom, Oktoberfest0日本平塚的七夕节010A: 你可以在奥地利的慕尼黑啤酒节（Oktoberfest）的多姆广场（Domplatz）背景中看到一座城堡0图1. 示例W EBQA数据集流程，其中问题需要找到并推理出两个相关来源并丢弃干扰项以生成正确的自然语言答案。0为此，数据集正在迅速涌现[10, 24,28]。但它们要么使用预定义的模板来策划多跳多模态问答对[28]，要么鼓励“问题分解+重定向到单模态模型”的方法来表面上解决问题[10]。然而，当人类吸收知识时，无需区分知识是从书籍还是图像中学到的，或者一段知识是多个零散片段的组合还是由单个片段承载的。我们认为，在相同的表示框架下，对语言概念和视觉基础概念进行推理的真正进展取决于开发一个统一的系统，无差别地将片段和图像视为知识载体。此外，目标还包括在异构信息环境中更好地提取、整合和总结能力。为了促进这一研究交叉点，在这项工作中，我们提出了一个新的基准，W EBQA，用于多跳、多模态、开放域问答，其中所有问题都是寻求知识并类似于现实世界的用例。在W EBQA上取得成功需要一个系统：a）同时融合文本和图像，b）在任一模态中检索相关知识，c）通过逻辑或数值推理从多个来源聚合信息，d）生成164960#训练集 #开发集 #测试集 #图像长度 Q长度 A长度0VQA v2 [9] 443K 214K 453K 200K 6.1 1.20OKVQA [18] 9.0K 0 5.0K 14.0K 8.1 1.3 MultiModalQA [28] 23.8K 2.4K3.6K 57.7K 18.2 2.1 ManyModalQA [10] 2.0K 3.0K 5.1K 2.9K - 1.0MIMOQA [24] 52.4K 0.7K 3.5K 400.0K - -0W EB QA（我们的）34.2K 5K 7.5K 390.0K 17.5 12.50表1. 按大小和平均问题/答案长度比较多模态知识寻求基准。10以自然语言形式回答。我们尝试了最先进的多模态推理和文本生成模型，它们的失败表明了未来的有希望的方向。02. 相关工作0许多数据集和任务可以广泛地被视为“问答”。例如，VQA[2, 9, 11,18]是语言和视觉交叉研究的广泛研究任务之一。然而，VQA模型如何适应开放域场景尚不清楚。这主要是由于将VQA任务简化为对固定词汇表中频繁答案的分类。最近关于视频的工作[15, 30,32]也采用了多选题的格式。相比之下，OK-VQA[18]将任务扩展到了知识寻求问题。OK-VQA和我们的任务在图像的作用上有所不同。在OK-VQA中，图像被视为查询的一部分，而不是知识源的一部分，并且只能在检索之后进行处理。在自然语言社区内，QA数据集正在经历从多选和跨度预测到更难的自由形式答案生成范式的类似转变。多跳问答最近成为热点，因为它与人类在知识获取过程中进行推理的多跳性质相吻合，从而导致了基准的增加[27, 31,34]。最近有几个关于多模态输入和上下文推理的基准[26]。MultiModalQA[28]首次尝试了需要对片段、表格和图像进行推理的复杂问题。它侧重于跨模态异构知识提取。然而，问题是从模板生成的。一旦检测到一个模板，任务就变成了使用特定于模态的答案机制填充空白。ManyModalQA[10]也涉及片段、图像和表格。然而，他们的设计主要解决的挑战是选择答案模态，而不是知识聚合或提取。我们的重点更多地是在统一空间中表示世界知识，而不是区分答案模态，因为掌握0注意，MultiModalQA和ManyModalQA也包含表格-ManyModal包含3.5K个表格，而MultiModalQA的数据集生成过程中使用了7万个表格，但最终数据集中有多少个表格尚不清楚。0评估指标答案模式03, 1}前几个训练答案OK-VQA0MultimodalQA精确匹配F10文本：跨度/Y/N图像：固定词汇表表格：Y/N、单元格或操作0ManymodalQA分类准确率上下文词或词汇0MIMOQA文本：ROUGE-1/-2/-L或BLEU图像：Precision@1/@2/@3跨度预测+图像检索0W EBQA（我们的）流畅度：BARTScore关键词准确率：召回率/F1完整的自然语言句子0表2. 知识寻求、多模态基准评估指标和答案模式的比较。0前者自然地消除了根据答案模态分类问题的需要。最后，MIMOQA[24]引入了“多模态输入多模态输出”的新概念，强调在文本答案中附加图像以增强认知理解。MIMOQA要求从上下文中选择一个文本片段和一个图像作为输出对。他们的方法与我们的方法非常互补。我们的区别在于，我们的任务在生成最终的自然语言答案之前还需要聚合和总结，而MIMOQA所需的输出模型并没有完全消化。在这里，我们将“消化”指的是能够产生一个合理的输出，该输出不能直接从输入中复制。表1、表2和附录E提供了W EBQA和相关数据集之间的比较。没有现有的多模态或知识寻求基准要求答案是完整的、自由形式的自然语言句子，而不是从有限集合中提取的跨度或元素。此外，以前的工作没有像我们这样同时支持自然语言生成（NLG）评估和准确性评估。为此，我们强调了以下几点：a）在W EBQA中，更重要的是消化、聚合和总结信息，因为答案不能简单地从现有的文本片段或图像块中复制；b）W EBQA除了VQA之外还需要源检索阶段，这更好地模拟了在网络搜索过程中的完整推理流程；c）以自然语言句子的形式回答更容易过渡到下游应用，如对话代理和语音助手。03. 任务表述0如图1所示，示例包括一个问题Q，一组正面源s1，...，sm（绿色），一组干扰源sm+1，...，sn（红色）和一个答案A。每个源可以是片段或（图像，描述）对。每个图像都附有一个描述，用于解决图像本身中不存在的名称或地理信息，但它们作为问题中的关键链接。我们包括了一个受限制的（n≈40）和完整的（n≈900K）设置。164970作为搜索引擎普及的范例，我们将数据结构化为可以通过图像搜索或通用网络（文本）搜索找到答案的形式。请注意，WebQA不包含需要图像和（独立的）片段作为知识源的问题。然而，所有基于图像的问题都需要同时处理图像和文本，因为图像描述提供了必要的0我们将任务分解为两个阶段。首先，给定问题Q和源s1，s2，...，sn，模型确定从哪些源中获取答案。第二个阶段是问答，模型将问题Q和选择的源作为上下文C，生成答案A。理想情况下，单阶段系统将同时处理Q，s1，s2，...，sn以生成A，C，但我们不知道任何可以消耗足够大的多模态上下文来实现这一点的建模方法，所以这留给将来的工作。04. 网络问答0按照搜索引擎普及的范例，我们将数据结构化为可以通过图像搜索或通用网络（文本）搜索找到答案的形式。请注意，WebQA不包含需要图像和（独立的）片段作为知识源的问题。然而，所有基于图像的问题都需要同时处理图像和文本，因为图像描述提供了必要的信息。下面我们将概述如何收集、结构化和过滤这两种类型的问题以确保质量。04.1. 图像答案0我们收集了既需要拼接两个图像才能回答的多图像问题，也收集了复杂的单图像问题。用户搜索日志中很少有大规模的丰富多图像问题，可能是因为用户认为搜索引擎无法处理这些问题，因此我们转向众包。我们向注释员展示了一组六个相关的图像，并要求他们通过选择每对问题所需的一个或两个图像来生成三个问答对。我们要求这三对中至少有一对使用了两个不同的图像。此外，我们指示注释员避免以下问题：a）简单的事实（例如“一辆汽车有多少个轮子”）；b）通过纯文本搜索很容易回答的问题；c）与特定图像相关的问题；d）确保每个问题在没有配对上下文的情况下都是有意义的。这揭示了我们任务与众所周知的VQA任务之间的一个关键区别。在大多数VQA风格的任务中，每个问题都与一对图像有关，而在我们的任务中，图像作为知识源进行推理，并不起到增强问题的作用。为了帮助注释员，每个图像都附有从维基百科中提取的描述。这个描述只能用来确认所描绘对象的名称或位置。答案必须从视觉线索中推导出来。图像是通过Bing Visual SearchAPI从维基共享资源中获取的。维基媒体的主题列表不能02虽然这里省略了细节，但我们向一家搜索公司请求了有关查询日志的基本统计信息，以确认这一点。0大多数类别都可以直接使用，因为它们（在视觉上）不太有趣。我们以自然场景为种子，并通过删除被标记为（在视觉上）不有趣的类别来迭代地完善图像库。这导致了动物、植物、景点和建筑等类别（图3）。0困难负样本挖掘。我们为模型提供一组基于文本和图像的困难负样本，供其筛选每个问题。文本源是根据问题中的名词短语从维基百科的相关段落中提取的，同时限制重叠以避免误判。对于图像，我们利用BingAPI找到与描述（通过Bing图像搜索）和视觉内容（通过Bing图像洞察）相似的图像。总共，我们收集了25K个基于图像的问题，每个问题平均需要1.4个视觉源，并配对了15.3个文本和15.9个视觉干扰源。问题前缀在图2中可视化。0分类。我们将问题分为开放类和封闭类。封闭类问题包括：颜色、形状、数量（即“多少”）、是/否（Y/N）和“多选”（MC）。其余的是开放类问题。0对抗性划分。我们构建测试集时尽可能使其处于分布之外，以奖励具有更好泛化和推理能力的模型。对于颜色、形状和数量问题，我们对答案集进行分区，并确保训练期间的多数类别不会延续到测试期间。对于“是/否”和“多选”类别，我们在10个随机的训练-测试划分上训练模型，并将在各个划分中始终困难的样本放入测试集。最后，我们随机划分来自开放类“其他”的问题。04.2. 文本答案0我们收集了涉及从≥2个片段中结合知识的多跳QA对。为了生成多样化但一致的主题，以挖掘困难的多跳推理问题，我们构建了相似实体的聚类，但其中文本片段具有较低的整体n-gram重叠或语义相似度（产生8K个聚类）。我们为注释者提供了四个片段，以防止他们贡献他们研究的事实来帮助回答问题。0困难负样本挖掘。对于文本干扰项，我们从维基百科中挖掘包含问题中名词短语的段落，并选择具有最高词汇重叠的段落。164980描述片段问题答案正确干扰正确干扰0图像 16.4 ± 6 14.4 ± 6 13.3 ± 11 12.6 ± 11 — 36.4 ± 10 文本 18.6 ± 810.7 ± 10 — 14.1 ± 13 45.3 ± 12 38.3 ± 100表3. 不同文本组件的长度分布。0但缺乏与答案相关的参考。对于图像干扰项，我们使用维基百科页面上的图像和描述，再次过滤出具有高词汇重叠的图像。总共，我们收集了24K个基于文本的问题，每个问题需要2.0个文本来源，并配对了14.6个文本和11.6个视觉干扰项。由于缺乏明确的问题分类标准，我们没有构建对抗性测试集，而是简单地随机抽样。04.3. 质量控制0我们通过众包工人培训和专家反馈循环来确保数据质量，这在众包中被发现是有效的因素[19]。最初

下载后可阅读完整内容，剩余1页未读，立即下载