视觉语音识别：松散同步特征流的解决方案

需积分: 10 71 浏览量更新于2025-01-04 收藏 481KB PDF 举报

本文主要探讨了"视觉语音识别（Visual Speech Recognition, VSR）在松散同步特征流中的应用"。作者们关注的是如何仅依赖视觉输入来检测和识别孤立的语音片段，这是一个具有挑战性的领域，因为语言的表达不仅仅依赖于唇部运动，还涉及口腔和喉咙的其他物理动作（即articulatory features），这些动作在说话时通常是不完全同步的。研究采用了一种创新的架构，首先通过区分性检测方法专注于视觉语音和口部特征的识别。这种检测器能够精确地区分出与语音存在相关的唇部外观，并进一步将其分解成反映发音生理成分的特征。例如，嘴唇、舌头、颚骨等不同部位的动作可能以相对独立的方式变化，这与传统的基于唇形（viseme）的识别方法无法捕捉到的协同发音效果（co-articulation）有关。为了处理这种松散同步的问题，研究者提出了一种新颖的动态贝叶斯网络（Dynamic Bayesian Network, DBN）。这种网络具有多流结构，其观测值包括来自口部特征分类器的得分，这样可以系统地建模不同程度的协同发音。网络设计考虑了语音特征之间的复杂交互，能够在识别过程中更好地适应这种动态变化。实验部分，研究者评估了他们的视觉语音识别系统在命令语句任务上的性能，包括唇部检测、语音/非语音分类以及与其他基础系统的识别准确度比较。结果显示，通过考虑口部特征的动态关系，他们的方法在处理协同发音效应方面表现出色，从而提高了整体的识别精度。本文的贡献在于提出了一种结合视觉特征分析和动态模型的视觉语音识别方法，有效解决了由于口部动作的非同步性带来的识别难题，对于理解和模仿人类语音通信具有重要意义。

Visual Speech Recognition with Loosely Synchronized Feature Streams

Kate Saenko, Karen Livescu, Michael Siracusa, Kevin Wilson, James Glass, and Trevor Darrell

Computer Science and Artiﬁcial Intelligence Laboratory

Massachusetts Institute of Technology

32 Vassar Street, Cambridge, MA, 02139, USA

saenko,klivescu,siracusa,kwilson,jrg,trevor@csail.mit.edu

Abstract

We present an approach to detecting and recognizing

spoken isolated phrases based solely on visual input. We

adopt an architecture that ﬁrst employs discriminative de-

tection of visual speech and articulatory features, and then

performs recognition using a model that accounts for the

loose synchronization of the feature streams. Discrimina-

tive classiﬁers detect the subclass of lip appearance corre-

sponding to the presence of speech, and further decompose

it into features corresponding to the physical components

of articulatory production. These components often evolve

in a semi-independent fashion, and conventional viseme-

based approaches to recognition fail to capture the result-

ing co-articulation effects. We present a novel dynamic

Bayesian network with a multi-stream structure and obser-

vations consisting of articulatory feature classiﬁer scores,

which can model varying degrees of co-articulation in a

principled way. We evaluate our visual-only recognition

system on a command utterance task. We show compara-

tive results on lip detection and speech/nonspeech classiﬁ-

cation, as well as recognition performance against several

baseline systems.

1. Introduction

The focus of most audio visual speech recognition

(AVSR) research is to ﬁnd effective ways of combining

video with existing audio-only ASR systems [15]. How-

ever, in some cases, it is difﬁcult to extract useful informa-

tion from the audio. Take, for example, a simple voice-

controlled car stereo system. One would like the user to

be able to play, pause, switch tracks or stations with simple

commands, allowing them to keep their hands on the wheel

and attention on the road. In this situation, the audio is cor-

rupted not only by the car’s engine and trafﬁc noise, but also

by the music coming from the stereo, so almost all useful

speech information is in the video. However, few authors

have focused on visual-only speech recognition as a stand-

alone problem. Those systems that do perform visual-only

recognition are usually limited to digit tasks. In these sys-

tems, speech is typically detected by relying on the audio

signal to provide the segmentation of the video stream into

speech and nonspeech [13].

A key issue is that the articulators (e.g. the tongue and

lips) can evolve asynchronously from each other, especially

in spontaneous speech, producing varying degrees of co-

articulation. Since existing systems treat speech as a se-

quence of atomic viseme units, they require many context-

dependent visemes to deal with coarticulation [17]. An al-

ternative is to model the multiple underlying physical com-

ponents of human speech production, or articulatory fea-

tures (AFs) [10]. The varying degrees of asynchrony be-

tween AF trajectories can be naturally represented using a

multi-stream model (see Section 3.2).

In this paper, we describe an end-to-end vision-only ap-

proach to detecting and recognizing spoken phrases, in-

cluding visual detection of speech activity. We use artic-

ulatory features as an alternative to visemes, and a Dy-

namic Bayesian Network (DBN) for recognition with mul-

tiple loosely synchronized streams. The observations of the

DBN are the outputs of discriminative AF classiﬁers. We

evaluate our approach on a set of commands that can be

used to control a car stereo system.

2. Related work

A comprehensive review of AVSR research can be found

in [17]. Here, we will brieﬂy mention work related to the

use of discriminative classiﬁers for visual speech recogni-

tion (VSR), as well as work on multi-stream and feature-

based modeling of speech.

In [6], an approach using discriminative classiﬁers was

proposed for visual-only speech recognition. One Sup-

port Vector Machine (SVM) was trained to recognize each

viseme, and its output was converted to a posterior proba-

bility using a sigmoidal mapping. These probabilities were

下载后可阅读完整内容，剩余7页未读，立即下载

huizi022

粉丝: 0
资源: 5

视觉语音识别：松散同步特征流的解决方案

【9】Speech recognition with deep recurrent neural networks.pdf

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS.pdf

HYBRID SPEECH RECOGNITION WITH DEEP BIDIRECTIONAL LSTM.pdf

speech recognition with DTW.rar_dtw_dtw算法_matlab_speech recognit

speechrecognition.rar_speech recognition

Speech Recognition.rar_in_recognition Word_speech recognition

speech_recognition.rar_speech recognition_speech-recognition

Building Speech Recognition Systems with the Kaldi Toolkit

SpeechRecognition.rar_MXIV_behaviorvl4_shotczg_speechrecognition

最新资源