多模态人机交互：跨领域综述与挑战

版权申诉

133 浏览量更新于2024-06-26 收藏 396KB DOCX 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

多模态人机交互(Multi-Modal Human-Computer Interaction, MMHCI)是一门综合性的跨学科领域，它结合了计算机视觉、心理学、人工智能、社会学、工效学等多个学科的知识。本文旨在提供一个全面的概述，关注于MMHCI的关键技术和方法，特别是围绕身体、手势、视线以及情感交互展开，如人脸表情识别和语音中的情感理解。首先，MMHCI的核心目标是提高计算机技术的易用性，使之更好地适应人类用户。实现这一目标需要深入理解用户的心理感知、认知能力、问题解决策略，以及他们在社会互动背景下的行为。设计者需要具备广泛的知识背景，以便创建出有效且易于操作的交互系统。文章指出，MMHCI的兴起源于对自然、混合模式交互的追求，这在人与人之间的沟通中普遍存在。过去的单模态技术，如语音和音频处理、计算机视觉等，以及低成本的硬件设备，如摄像头和传感器，已经推动了MMHCI领域的显著发展。与传统的基于单一输入设备（如鼠标和键盘）的人机交互相比，MMHCI允许用户通过多种方式（如声音、肢体动作、面部表情等）与计算机进行无缝交流，从而增强了交互的自然性和效率。 1.1 动机部分强调了MMHCI的发展是由对更真实、直观的交互方式的需求驱动的。随着技术的进步，研究人员希望打破传统的界面限制，使计算机能更好地理解和响应用户的复杂表达。这包括但不限于理解用户的情绪状态，通过面部表情解读意图，以及通过手势控制来扩展交互范围。在MMHCI的研究中，一个重要的话题是用户和任务建模，即如何构建模型来代表用户的行为模式和期望，以便计算机可以根据这些信息做出相应的反应。此外，融合不同模态信息（multimodal fusion）也是一个关键环节，旨在整合来自各种传感器的数据，提升交互的准确性和鲁棒性。文章还讨论了MMHCI面临的挑战，如如何准确地识别和解析多源数据，如何处理模态之间的异质性，以及如何确保在多模态交互中的隐私和安全问题。此外，文章还提到了该领域的热点课题，如深度学习在多模态识别中的应用，以及新兴的应用场景，如虚拟现实、增强现实、智能家居等，这些都预示着MMHCI的未来发展方向。总结来说，这篇综述论文为读者提供了一个多模态人机交互的全景图，不仅涵盖了技术层面的细节，也着重探讨了其背后的社会、心理和工程学原理，以及未来的可能性和趋势。

资源详情

资源推荐

performing a particular action. The parameters of the model are then a description of the hand pose or

trajectory and depend on the modeling approach used. Among the important problems involved in the

analysis are hand localization [187], hand tracking [194], and the selection of suitable features [83].

After the parameters are computed, the gestures represented by them need to be classified and

interpreted based on the accepted model and based on some grammar rules that reflect the internal

syntax of gestural commands. The grammar may also encode the interaction of gestures with other

communication modes such as speech, gaze, or facial expressions. As an alternative to modeling,

some authors have explored the use of combinations of simple 2D motion based detectors for gesture

recognition [71].

In any case, to fully exploit the potential of gestures for an MMHCI application, the class of

possible recognized gestures should be as broad as possible and ideally any gesture performed by

the user should be unambiguously interpretable by the interface. However, most of the gesture-based

HCI systems allow only symbolic commands based on hand posture or 3D pointing. This is due to the

complexity associated with gesture analysis and the desire to build real-time interfaces. Also, most of

the systems accommodate only single-hand gestures. Yet, human gestures, especially

communicative gestures, naturally employ actions of both hands. However, if the two-hand gestures

are to be allowed, several ambiguous situations may appear (e.g., occlusion of hands, intentional vs.

unintentional, etc.) and the processing time will likely increase. Another important aspect that is

increasingly considered is the use of other modalities (e.g., speech) to augment the MMHCI system

[127,162]. The use of such multimodal approaches can reduce the complexity and increase the

naturalness of the interface for MMHCI [126].

3.3. Gaze detection

Gaze, defined as the direction to which the eyes are pointing in space, is a strong indicator of

attention, and it has been studied extensively since as early as 1879 in psychology, and more recently

in neuroscience and in computing applications[41]. While early eye tracking research focused only on

systems for in-lab experiments, many commercial and experimental systems are available today for a

wide range of applications.

Eye tracking systems can be grouped into wearable or non-wearable, and infrared-based or

appearance-based. In infrared-based systems, a light shining on the subject whose gaze is to be

tracked creates a ‘ ‘red-eye effect:’’ the difference in reflection between the cornea and the pupil

is used to determine the direction of sight. In appearance based systems, computer vision techniques

are used to find the eyes in the image and then determine their orientation. While wearable systems

are the most accurate (approximate error rates below 1.4_ vs. errors below 1.7_ for nonwearable

infrared), they are also the most intrusive. Infrared systems are more accurate than

appearance-based, but there are concerns over the safety of prolonged exposure to infrared lights. In

addition, most non-wearable systems require (often cumbersome) calibration for each individual

[108,121].

Appearance-based systems usually capture both eyes using two cameras to predict gaze

direction. Due to the computational cost of processing two streams simultaneously, the resolution of

the image of each eye is often small. This makes such systems less accurate, although increasing

computational power and lower costs mean that more computationally intensive algorithms can be run

in real time. As an alternative, in [181], the authors propose using a single high-resolution image of

one eye to improve accuracy. On the other hand, infrared-based systems usually use only one

camera, but the use of two cameras has been proposed to further increase accuracy[152].

Although most research on non-wearable systems has focused on desktop users, the ubiquity of

computing devices has allowed for application in other domains in which the user is stationary (e.g.,

[168,152]). For example, the authors of [168] monitor driver visual attention using a single,

non-wearable camera placed on a car’s dashboard to track face features and for gaze detection.

Wearable eye trackers have also been investigated mostly for desktop applications (or for users

that do not walk wearing the device). Also, because of advances in hardware (e.g., reduction in size

and weight) and lower costs, researchers have been able to investigate uses in novel applications.

For example, in [193], eye tracking data are combined with video from the user’s perspective, head

directions, and hand motions to learn words from natural interactions with users; the authors of [137]

use a wearable eye tracker to understand hand–eye coordination in natural tasks, and the authors of

[38] use a wearable eye tracker to detect eye contact and record video for blogging.

The main issues in developing gaze tracking systems are intrusiveness, speed, robustness, and

accuracy. The type of hardware and algorithms necessary, however, depend highly on the level of

analysis desired. Gaze analysis can be performed at three different levels [23]: (a) highly detailed

low-level micro-events, (b) low-level intentional events, and (c) coarse-level goal-based events.

Micro-events include micro-saccades, jitter, nystagmus, and brief fixations, which are studied for their

physiological and psychological relevance by vision scientists and psychologists. Low-level intentional

events are the smallest coherent units of movement that the user is aware of during visual activity,

which include sustained fixations and revisits. Although most of the work on HCI has focused on

coarse-level goal based events (e.g., using gaze as a pointer [165]), it is easy to foresee the

importance of analysis at lower levels, particularly to infer the user’s cognitive state in affective

interfaces (e.g., [62]). Within this context, an important issue often overlooked is how to interpret

eye-tracking data. In other words, as the user moves his eyes during interaction, the system must

decide what the movements mean in order to react accordingly. We move our eyes 2 –3 times per

second, so a system may have to process large amounts of data within a short time, a task that is not

trivial even if processing does not occur in real-time. One way to interpret eye tracking data is to

cluster fixation points and assume, for instance, that clusters correspond to areas of interest.

Clustering of fixation points is only one option, however, and as the authors of[154] discuss, it can be

剩余30页未读，继续阅读

若♡

粉丝: 6229
资源: 1万+

会员权益专享

多模态人机交互：跨领域综述与挑战

ChatGPT技术与多模态人机交互的探索.docx

多模态人机交互综述(译文).pdf

多模态AI产业链全景梳理.docx

python多模态人机交互

面向深度学习的多模态融合技术研究综述_何俊.pdf

多模态新闻推荐系统综述

给我推荐20个比流行的多模态模型

雷达多模态人体活动识别综述

多模态交互式AI案例分析

AIGC的多模态融合和交互

多模态图像融合算法综述

多模态假新闻检测综述

多模态特征交互的研究挑战是什么

人机交互技术的研究2000字

多模态交互的国外研究现状

给我推荐20个多模态模型

查找多模态情感识别的论文

大模型技术进化论:多模态大模型综述 pdf

多模态只指哪些多模态

会员权益专享

最新资源