Visual Speech Recognition with Loosely Synchronized Feature Streams
Kate Saenko, Karen Livescu, Michael Siracusa, Kevin Wilson, James Glass, and Trevor Darrell
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
32 Vassar Street, Cambridge, MA, 02139, USA
saenko,klivescu,siracusa,kwilson,jrg,trevor@csail.mit.edu
Abstract
We present an approach to detecting and recognizing
spoken isolated phrases based solely on visual input. We
adopt an architecture that first employs discriminative de-
tection of visual speech and articulatory features, and then
performs recognition using a model that accounts for the
loose synchronization of the feature streams. Discrimina-
tive classifiers detect the subclass of lip appearance corre-
sponding to the presence of speech, and further decompose
it into features corresponding to the physical components
of articulatory production. These components often evolve
in a semi-independent fashion, and conventional viseme-
based approaches to recognition fail to capture the result-
ing co-articulation effects. We present a novel dynamic
Bayesian network with a multi-stream structure and obser-
vations consisting of articulatory feature classifier scores,
which can model varying degrees of co-articulation in a
principled way. We evaluate our visual-only recognition
system on a command utterance task. We show compara-
tive results on lip detection and speech/nonspeech classifi-
cation, as well as recognition performance against several
baseline systems.
1. Introduction
The focus of most audio visual speech recognition
(AVSR) research is to find effective ways of combining
video with existing audio-only ASR systems [15]. How-
ever, in some cases, it is difficult to extract useful informa-
tion from the audio. Take, for example, a simple voice-
controlled car stereo system. One would like the user to
be able to play, pause, switch tracks or stations with simple
commands, allowing them to keep their hands on the wheel
and attention on the road. In this situation, the audio is cor-
rupted not only by the car’s engine and traffic noise, but also
by the music coming from the stereo, so almost all useful
speech information is in the video. However, few authors
have focused on visual-only speech recognition as a stand-
alone problem. Those systems that do perform visual-only
recognition are usually limited to digit tasks. In these sys-
tems, speech is typically detected by relying on the audio
signal to provide the segmentation of the video stream into
speech and nonspeech [13].
A key issue is that the articulators (e.g. the tongue and
lips) can evolve asynchronously from each other, especially
in spontaneous speech, producing varying degrees of co-
articulation. Since existing systems treat speech as a se-
quence of atomic viseme units, they require many context-
dependent visemes to deal with coarticulation [17]. An al-
ternative is to model the multiple underlying physical com-
ponents of human speech production, or articulatory fea-
tures (AFs) [10]. The varying degrees of asynchrony be-
tween AF trajectories can be naturally represented using a
multi-stream model (see Section 3.2).
In this paper, we describe an end-to-end vision-only ap-
proach to detecting and recognizing spoken phrases, in-
cluding visual detection of speech activity. We use artic-
ulatory features as an alternative to visemes, and a Dy-
namic Bayesian Network (DBN) for recognition with mul-
tiple loosely synchronized streams. The observations of the
DBN are the outputs of discriminative AF classifiers. We
evaluate our approach on a set of commands that can be
used to control a car stereo system.
2. Related work
A comprehensive review of AVSR research can be found
in [17]. Here, we will briefly mention work related to the
use of discriminative classifiers for visual speech recogni-
tion (VSR), as well as work on multi-stream and feature-
based modeling of speech.
In [6], an approach using discriminative classifiers was
proposed for visual-only speech recognition. One Sup-
port Vector Machine (SVM) was trained to recognize each
viseme, and its output was converted to a posterior proba-
bility using a sigmoidal mapping. These probabilities were