面部神经对话模型：融合文本与表情的自然交流

需积分: 9 28 浏览量更新于2024-09-09 收藏 967KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"A Face-to-Face Neural Conversation Model" 是一项创新的研究，由 Hang Chu、Daiqing Li 和 Sanja Fidler 在 2018 年的计算机视觉与模式识别 (CVPR) 大会上提出。这项工作关注的是如何利用深度学习技术改进人机对话的质量，使之更接近现实中的面对面交流。传统的文本对话模型仅依赖于言语文本，而忽视了非言语交际的重要元素，如面部表情和身体语言。该研究团队提出了一种融合神经网络的对话模型，其核心是一个递归神经网络 (RNN) 的编码器-解码器结构。这个模型不仅处理口头语言（即文本），还能够理解和生成面部表情，从而根据对话的情绪或语境调整回应。这种模型的关键组成部分是两个层的解码器：底层负责生成口头响应和粗略的面部表情，而上层则负责添加更为微妙的面部手势，以提升生成输出的自然度和流畅性。研究人员通过对大量电影（具体提到的是250部）进行观察和学习，训练这个神经网络模型，使其能够理解并模仿真实世界中人物的面部动作和言语。评估模型性能的方法包括自动评价指标和人类用户的研究，后者通过对比模型生成的对话与人工对话的自然程度，来验证模型在提高对话体验方面的有效性。通过这个模型，研究人员展示了其在生成更自然对话方面的潜力，无论是通过客观的自动评估还是主观的人类评价，都显示出它在提升对话的真实感和适应性方面取得了显著进步。此外，研究还应用了一个面对面聊天机器人的实例，展示了模型的实际应用场景，比如在虚拟客服、游戏交互或者情感智能系统中可能的应用。 "A Face-to-Face Neural Conversation Model" 这项研究代表了人工智能领域的一个重要进展，它强调了将非言语信号纳入对话系统设计的重要性，旨在提供更加丰富、真实的人机交互体验。在未来，这类模型有可能在人际沟通的模拟和情感理解方面发挥更大的作用。

资源详情

资源推荐

A Face-to-Face Neural Conversation Model

Hang Chu

1,2

Daiqing Li

Sanja Fidler

1,2

University of Toronto

Vector Institute

{chuhang1122, daiqing, fidler}@cs.toronto.edu

Abstract

Neural networks have recently become good at engaging

in dialog. However, current approaches are based solely

on verbal text, lacking the richness of a real face-to-face

conversation. We propose a neural conversation model that

aims to read and generate facial gestures alongside with

text. This allows our model to adapt its response based on

the “mood” of the conversation. In particular, we intro-

duce an RNN encoder-decoder that exploits the movement

of facial muscles, as well as the verbal conversation. The

decoder consists of two layers, where the lower layer aims

at generating the verbal response and coarse facial expres-

sions, while the second layer ﬁlls in the subtle gestures,

making the generated output more smooth and natural. We

train our neural network by having it “watch” 250 movies.

We showcase our joint face-text model in generating more

natural conversations through automatic metrics and a hu-

man study. We demonstrate an example application with a

face-to-face chatting avatar.

1. Introduction

We make conversation everyday. We talk to our fam-

ily, friends, colleagues, and sometimes we also chat with

robots. Several online services employ robot agents to di-

rect customers to the service they are looking for. Question-

answering systems like Apple Siri and Amazon Alexa have

also become a popular accessory. However, while most of

these automatic systems feature a human voice, they are far

from acting like human beings. They lack in expressivity,

and are typically emotionless.

Language alone can often be ambiguous with respect to

the person’s mood, unless indicative sentiment words are

being used. In real life, people make gestures and read other

people’s gestures when they communicate. Whether some-

one is smiling, crying, shouting, or frowning when saying

“thank you” can indicate various feelings from gratitude to

irony. People also form their response depending on such

demo/data: http://www.cs.toronto.edu/face2face

Figure 1: Facial gestures convey sentiment information. Words

have different meanings with different facial gestures. Saying

“Thank you” with different gestures could either express gratitude,

or irony. Therefore, a different response should be triggered.

context, not only in what they say but also in how they say

it. We aim at developing a more natural conversation model

that jointly models text and gestures, in order to act and

converse in a more natural way.

Recently, neural networks have been shown to be good

conversationalists [33, 15]. These typically make use of

an RNN encoder which represents the history of the ver-

bal conversation and an RNN decoder that generates a re-

sponse. [16] built on top of this idea with the aim to person-

alize the model by adapting the conversation to a particular

user. However, all these approaches are based solely on text,

lacking the richness of a real face-to-face conversation.

In this paper, we introduce a neural conversation model

that reads and generates both a verbal response (text) and

facial gestures. We exploit movies as a rich resource of

such information. Movies show a variety of social situa-

tions with diverse emotions, reactions, and topics of con-

versation, making them well suited for our task. Movies

are also multi-modal, allowing us to exploit both visual as

well as dialogue information. However, the data itself is

also extremely challenging due to many characters that ap-

pear on-screen at any given time, as well as large variance

in pose, scale, and recording style.

Our model adopts the encoder-decoder architecture and

adds gesture information in both the encoder as well as the

decoder. We exploit the FACS representation [8] of ges-

7113

下载后可阅读完整内容，剩余8页未读，立即下载

AlgoFei

粉丝: 9
资源: 53

面部神经对话模型：融合文本与表情的自然交流

【10】Towards End-to-End Speech Recognitionwith Recurrent Neural Networks.pdf

【CVPR2018】A Face-to-Face Neural Conversation Mode

解释下Dual-Stage Attention-Based Recurrent Neural Network模型原理，尤其是两个阶段注意力机制的作用

公式推导下Dual-Stage Attention-Based Recurrent Neural Network 的原理

simam: a simple, parameter-free attention module for convolutional neural networks

track-before-detect with neural networks

spreadgnn: serverless multi-task federated learning for graph neural network

Write pytorch-based Python code to implement a neural network that solves a regression problem with an output layer of a positively weighted sub-network plus a negatively weighted sub-network.

SphereFace: Deep Hypersphere Embedding for Face Recognition

https://www.quora.com/Why-do-we-use-an-RNN-instead-of-a-simple-neural-network

Adaptive Normalized Risk-Averting Training for Deep Neural Networks

有没有transformer的使用链接

找到一个好的介绍神经网络的PPT链接

Levi D. McClenny∗ Ulisses Braga-Neto写的Self-Adaptive Physics-Informed Neural Networks using a Soft Attention Mechanism中提到的最大化损失函数什么意思

One-Shot Free-View Neural Talking-Head Synthesis

python keras

图神经网络示例图的网址

attention-lstm

最新资源