2 Foreword
strained task domain. Two of the systems, Harpy and Hearsay-II, both developed at Came-
gie-Mellon University, achieved the original goals and in some instances surpassed them.
During the last three decades I have been at Carnegie Mellon, I have been very fortu-
nate to be able to work with many brilliant students and researchers. Xuedong Huang, Alex
Acero and Hsiao-Wuen Hon were arguably among the outstanding researchers in the speech
group at CMU. Since then they have moved to Microsoft and have put together a world-class
team at Microsoft Research. Over the years, they have contributed with standards for build-
ing spoken language understanding systems with Microsoft’s SAPI/SDK family of products,
and pushed the technologies forward with the rest of the community. Today, they continue to
play a premier leadership role in both the research community and in industry.
The new book “Spoken Language Processing” by Huang, Acero and Hon represents a
welcome addition to the technical literature on this increasingly important emerging area of
Information Technology. As we move from desktop PCs to personal digital assistants
(PDAs), wearable computers, and Internet cell phones, speech becomes a central, if not the
only, means of communication between the human and machine! Huang, Acero, and Hon
have undertaken a commendable task of creating a comprehensive reference manuscript cov-
ering theoretical, algorithmic and systems aspects of spoken language tasks of recognition,
synthesis and understanding.
The task of spoken language communication requires a system to recognize, interpret,
execute and respond to a spoken query. This task is complicated by the fact that the speech
signal is corrupted by many sources: noise in the background, characteristics of the micro-
phone, vocal tract characteristics of the speakers, and differences in pronunciation. In addi-
tion the system has to cope with non-grammaticality of spoken communication and ambigu-
ity of language. To solve the problem, an effective system must strive to utilize all the avail-
able sources of knowledge, i.e., acoustics, phonetics and phonology, lexical, syntactic and
semantic structure of language, and task specific context dependent information.
Speech is based on a sequence of discrete sound segments that are linked in time.
These segments, called phonemes, are assumed to have unique articulatory and acoustic
characteristics. While the human vocal apparatus can produce an almost infinite number of
articulatory gestures, the number of phonemes is limited. English as spoken in the United
States, for example, contains 16 vowel and 24 consonant sounds. Each phoneme has distin-
guishable acoustic characteristics and, in combination with other phonemes, forms larger
units such as syllables and words. Knowledge about the acoustic differences among these
sound units is essential to distinguish one word from another, say “bit” from “pit.”
When speech sounds are connected to form larger linguistic units, the acoustic charac-
teristics of a given phoneme will change as a function of its immediate phonetic environment
because of the interaction among various anatomical structures (such as the tongue, lips, and
vocal chords) and their different degrees of sluggishness. The result is an overlap of phone-
mic information in the acoustic signal from one segment to the other. For example, the same
underlying phoneme “t” can have drastically different acoustic characteristics in different
words, say, in “tea,” “tree,” “city,” “beaten.” and “steep.” This effect, known as coarticula-
tion, can occur within a given word or across a word boundary. Thus, the word “this” will
have very different acoustic properties in phrases such as “this car” and “this ship.”