多模态双线性池在课本答题中的应用

研究论文

45 浏览量更新于2024-08-26 收藏 1002KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

ESSAY-ANCHOR ATTENTIVE MULTI-MODAL BILINEAR POOLING FOR

TEXTBOOK QUESTION ANSWERING

Juzheng Li, Hang Su, Jun Zhu, Bo Zhang

∗

Department of Computer Science and Technology, Tsinghua Lab of Brain and Intelligence

Beijing National Research Center for Information Science and Technology, BNRist Lab

Tsinghua University, 100084, China

lijuzheng09@gmail.com; {suhangss, dcszj, dcszb}@tsinghua.edu.cn

ABSTRACT

Textbook Question Answering (TQA) [1] is a newly proposed

task to answer arbitrary questions in middle school curricula,

which has particular challenges to understand the long essays

in additional to the images. Bilinear models [2, 3] are effec-

tive at learning high-level associations between questions and

images, but are inefﬁcient to handle the long essays. In this

paper, we propose an Essay-anchor Attentive Multi-modal Bi-

linear pooling (EAMB), a novel method to encode the long

essays into the joint space of the questions and images. The

essay-anchors, embedded from the keywords, represent the

essay information in a latent space. We propose a novel net-

work architecture to pay special attention on the keywords

in the questions, consequently encoding the essay informa-

tion into the question features, and thus the joint space with

the images. We then use the bilinear models to extract the

multi-modal interactions to obtain the answers. EAMB suc-

cessfully utilizes the redundancy of the pre-trained word em-

bedding space to represent the essay-anchors. This avoids

the extra learning difﬁculties from exploiting large network

structures. Quantitative and qualitative experiments show the

outperforming effects of EAMB on the TQA dataset.

Index Terms— Textbook Question Answering, Word

Embedding, Multi-Modal Bilinear Pooling, Attention Mech-

anisms

1. INTRODUCTION

The computer vision community has witnessed a great

progress on the Visual Question Answering (VQA) tasks in

the recent years. With large multi-modal datasets [4, 5] and

methods [4, 6] available, machines are able to answer short

questions with given images. However, VQA tasks are far

from real-world situations. Human answers a question not

∗

The work is supported by the National NSF of China (Nos. 61571261,

61620106010, 61621136008, 61332007, and U1611461), Beijing Natural

Science Foundation (No. L172037), Tsinghua Tiangong Institute for Intelli-

gent Computing and the NVIDIA NVAIL Program, and partially funded by

Microsoft Research Asia and Tsinghua-Intel Joint Research Institute.

Erosion and Deposition

by Flowing Water

How Flowing Water Causes

Erosion and Deposition

Water Speed and Erosion

Particle Size and Erosion

How many actions are

depicted in the diagram?

a. 6

b. 4

c. 8

d. 7

Question

Visual Context

Textual Context

Apply Concepts

Introduction

Lesson Objectives

Lesson Summary

Points to Consider

Question Stem

Options

Right Answer

Title

Subheads

Contents

Supplementary

Materials

Long

Essay

Fig. 1: An example question of the TQA task. It consists of a question stem

and several candidate answers. A textual context is deﬁnitely given to explain

the background, including a long essay and some supplementary materials.

We combine the materials into the long essay in this paper. A visual context

usually includes an image.

only by the current scene, but also with abundant background

knowledge. Textbook Question Answering (TQA) is a newly

proposed task that aims to make QA situations closer to

the real world [1]. The TQA dataset is drawn from middle

school curricula. A TQA question consists of a long essay, a

short question stem, an image and several candidate answers

(Fig. 1).

The TQA task is challenging because the multi-modal

context includes the long essays. Recent multi-modal meth-

ods usually encode the visual and textual data into a joint

space to learn their interactions. But for the TQA task, re-

current neural networks (RNNs) are not capable to encode

such long essays. Moreover, the recent progress in attention

or memory mechanisms [6, 1] usually requires to exploit a

large scale of add-on network structures, which will deﬁnitely

reduce the learning efﬁciency.

In this paper, we propose an Essay-anchor Attentive

Multi-modal Bilinear pooling (EAMB) to address the long-

essay issue of the TQA task. EAMB embeds the long essays

into a continuous space represented by the essay-anchors col-

lectively. Each essay-anchor is corresponding to a keyword

978-1-5386-1737-3/18/$31.00

2018 IEEE

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38626075

粉丝: 7
资源: 925

多模态双线性池在课本答题中的应用

TEmyMultimodeData.rar_TE_TE多模态数据_gojgz_te过程_多模态 过程

多模态大语言模型综述来啦！一文带你理清多模态关键技术.pdf

双线性池化多模态融合代码举例

多模态dbms学习多模态表示

多模态特征融合和多模态学习的区别

多模态分割和多模态语义分割有什么区别

多模态融合地理大数据

双线性 attention

多模态只指哪些多模态

多模态情感识别技术可行性

多模态知识库中多模态关联用到的技术

多模态信息融合评价指标的重要性

多模态数据融合和多模态特征融合的区别？

多模态在线哈希的国外研究现状

cvpr 2022多模态

多模态分层融合的优缺点

多模态融合算法的优点

多模态对齐融合python

多模态大模型是如何克服跨模态间差异性的？ 在多模态任务中，如何评估模型性能以及优化模型效果？ 大型多模态模型在处理实时数据流时面临哪些挑战？

为什么充分利用这种高度异质性的多模态数据是一件十分具有挑战性的工作

最新资源

TEmyMultimodeData.rar_TE_TE多模态数据_gojgz_te过程_多模态过程

多模态大模型是如何克服跨模态间差异性的？在多模态任务中，如何评估模型性能以及优化模型效果？大型多模态模型在处理实时数据流时面临哪些挑战？