语音识别技术：从基础理论到系统实现

5星 · 超过95%的资源需积分: 10 132 浏览量更新于2024-07-31 1 收藏 10.82MB PDF 举报

"语音识别是将人类口语转化为可理解文本的技术，它在人机交互、语音转文字、语音翻译等领域有着广泛的应用。本书深入探讨了语音识别及其相关领域的理论与实践，适合相关专业人员和爱好者学习。" 语音识别是信息技术中的一个重要领域，它涉及到语音信号处理、自然语言处理和人工智能等多个方面的知识。语音识别技术使得计算机能够理解并转化人类的口头语言，从而实现无触控的人机交互，如智能助手、车载导航系统、电话客服自动化等。 1. **动机与应用场景** - **语音接口**：提供了一种方便用户与设备交互的方式，尤其对那些不便使用键盘或触摸屏的用户（如驾驶者、身体残疾者）。 - **语音到语音翻译**：实时将一种语言的语音转换为另一种语言的语音，促进跨语言沟通。 - **知识伙伴**：语音识别技术可以用于个人助手，如智能音箱，帮助用户获取信息、执行任务。 2. **语音识别系统架构** - **自动语音识别 (ASR)**：是语音识别的核心部分，将输入的音频信号转化为文本。 - **文本到语音 (TTS)**：将文本信息转化为可听见的语音，与ASR共同构建完整的语音交互系统。 - **语音理解 (SLU)**：解析识别出的文本，理解其语义含义，以便于系统做出相应反应。 3. **书本组织结构** - **基础理论**：涵盖语音识别的基础概念和技术。 - **语音处理**：讨论语音信号的特征提取和处理。 - **语音识别**：深入研究ASR的算法和模型。 - **文本到语音系统**：介绍TTS技术的实现原理。 - **语音识别系统**：综合讨论完整的语音交互系统设计。 4. **目标读者与历史视角** - 本书面向的研究人员、工程师以及对此领域感兴趣的读者。 - 历史视角回顾了语音识别技术的发展历程，并提供了进一步学习的参考文献。 5. **语言结构** - **声音与人类语音系统**：讲解声学基础和人类如何产生及感知语音。 - **音位学与音系学**：讨论语音的最小单位——音位，以及音位在不同语境下的变化（音变）和说话速度的影响。 - **音节与词汇**：分析音节结构和单词构成，为识别过程提供基础。 - **句法与语义**：探讨语言的结构规则（句法）和意义表示（语义），这是理解和生成自然语言的关键。通过这些章节，读者将能够掌握语音识别的基本原理，了解语音信号的物理特性，学习如何处理和分析语音数据，以及如何建立有效的语音识别系统。此外，对于自然语言处理中的句法和语义分析，也有助于提高系统的理解能力。

Foreword

Recognition and understanding of spontane-

ous unrehearsed speech remains an elusive goal. To understand speech, a human considers

not only the specific information conveyed to the ear, but also the context in which the in-

formation is being discussed. For this reason, people can understand spoken language even

when the speech signal is corrupted by noise. However, understanding the context of speech

is, in turn, based on a broad knowledge of the world. And this has been the source of the

difficulty and over forty years of research.

It is difficult to develop computer programs that are sufficiently sophisticated to under-

stand continuous speech by a random speaker. Only when programmers simplify the prob-

lem—by isolating words, limiting the vocabulary or number of speakers, or constraining the

way in which sentences may be formed—is speech recognition by computer possible.

Since the early 1970s, researchers at ATT, BBN, CMU, IBM, Lincoln Labs, MIT, and

SRI have made major contributions in Spoken Language Understanding Research. In 1971,

the Defense Advanced Research Projects Agency (Darpa) initiated an ambitious five-year,

$15 million, multisite effort to develop speech-understanding systems. The goals were to

develop systems that would accept continuous speech from many speakers, with minimal

speaker adaptation, and operate on a 1000-word vocabulary, artificial syntax, and a con-

2 Foreword

strained task domain. Two of the systems, Harpy and Hearsay-II, both developed at Came-

gie-Mellon University, achieved the original goals and in some instances surpassed them.

During the last three decades I have been at Carnegie Mellon, I have been very fortu-

nate to be able to work with many brilliant students and researchers. Xuedong Huang, Alex

Acero and Hsiao-Wuen Hon were arguably among the outstanding researchers in the speech

group at CMU. Since then they have moved to Microsoft and have put together a world-class

team at Microsoft Research. Over the years, they have contributed with standards for build-

ing spoken language understanding systems with Microsoft’s SAPI/SDK family of products,

and pushed the technologies forward with the rest of the community. Today, they continue to

play a premier leadership role in both the research community and in industry.

The new book “Spoken Language Processing” by Huang, Acero and Hon represents a

welcome addition to the technical literature on this increasingly important emerging area of

Information Technology. As we move from desktop PCs to personal digital assistants

(PDAs), wearable computers, and Internet cell phones, speech becomes a central, if not the

only, means of communication between the human and machine! Huang, Acero, and Hon

have undertaken a commendable task of creating a comprehensive reference manuscript cov-

ering theoretical, algorithmic and systems aspects of spoken language tasks of recognition,

synthesis and understanding.

The task of spoken language communication requires a system to recognize, interpret,

execute and respond to a spoken query. This task is complicated by the fact that the speech

signal is corrupted by many sources: noise in the background, characteristics of the micro-

phone, vocal tract characteristics of the speakers, and differences in pronunciation. In addi-

tion the system has to cope with non-grammaticality of spoken communication and ambigu-

ity of language. To solve the problem, an effective system must strive to utilize all the avail-

able sources of knowledge, i.e., acoustics, phonetics and phonology, lexical, syntactic and

semantic structure of language, and task specific context dependent information.

Speech is based on a sequence of discrete sound segments that are linked in time.

These segments, called phonemes, are assumed to have unique articulatory and acoustic

characteristics. While the human vocal apparatus can produce an almost infinite number of

articulatory gestures, the number of phonemes is limited. English as spoken in the United

States, for example, contains 16 vowel and 24 consonant sounds. Each phoneme has distin-

guishable acoustic characteristics and, in combination with other phonemes, forms larger

units such as syllables and words. Knowledge about the acoustic differences among these

sound units is essential to distinguish one word from another, say “bit” from “pit.”

When speech sounds are connected to form larger linguistic units, the acoustic charac-

teristics of a given phoneme will change as a function of its immediate phonetic environment

because of the interaction among various anatomical structures (such as the tongue, lips, and

vocal chords) and their different degrees of sluggishness. The result is an overlap of phone-

mic information in the acoustic signal from one segment to the other. For example, the same

underlying phoneme “t” can have drastically different acoustic characteristics in different

words, say, in “tea,” “tree,” “city,” “beaten.” and “steep.” This effect, known as coarticula-

tion, can occur within a given word or across a word boundary. Thus, the word “this” will

have very different acoustic properties in phrases such as “this car” and “this ship.”

PREFACE

Our primary motivation in writing this book

is to share our working experience to bridge the gap between the knowledge of industry gu-

rus and newcomers to the spoken language processing community. Many powerful tech-

niques hide in conference proceedings and academic papers for years before becoming

widely recognized by the research community or the industry. We spent many years pursuing

spoken language technology research at Carnegie Mellon University before we started spo-

ken language R&D at Microsoft. We fully understand that it is by no means a small under-

taking to transfer a state of the art spoken language research system into a commercially vi-

able product that can truly help people improve their productivity. Our experience in both

industry and academia is reflected in the context of this book, which presents a contemporary

and comprehensive description of both theoretic and practical issues in spoken language

processing. This book is intended for people of diverse academic and practical backgrounds.

Speech scientists, computer scientists, linguists, engineers, physicists and psychologists all

have a unique perspective to spoken language processing. This book will be useful to all of

these special interest groups.

Spoken language processing is a diverse subject that relies on knowledge of many lev-

els, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and dis-

course. The diverse nature of spoken language processing requires knowledge in computer

science, electrical engineering, mathematics, syntax, and psychology. There are a number of

excellent books on the sub-fields of spoken language processing, including speech recogni-

tion, text to speech conversion, and spoken language understanding, but there is no single

book that covers both theoretical and practical aspects of these sub-fields and spoken lan-

guage interface design. We devote many chapters systematically introducing fundamental

theories needed to understand how speech recognition, text to speech synthesis, and spoken

ii Preface

language understanding work. Even more important is the fact that the book highlights what

works well in practice, which is invaluable if you want to build a practical speech recognizer,

a practical text to speech synthesizer, or a practical spoken language system. Using numer-

ous real examples in developing Microsoft’s spoken language systems, we concentrate on

showing how the fundamental theories can be applied to solve real problems in spoken lan-

guage processing.

We would like to thank many people who helped us during our spoken language proc-

essing R&D careers. We are particularly indebted to Professor Raj Reddy at the School of

Computer Science, Carnegie Mellon University. Under his leadership, Carnegie Mellon Uni-

versity has become a center of research excellence on spoken language processing. Today’s

computer industry and academia benefited tremendously from his leadership and contribu-

tions.

Special thanks are due to Microsoft for its encouragement of spoken language R&D.

The management team at Microsoft has been extremely generous to our. We are particularly

grateful to Bill Gates, Nathan Myhrvold, Rick Rashid, Dan Ling, and Jack Breese for the

great environment they created for us at Microsoft Research.

Scott Meredith helped us writing a number of chapters in this book and deserves to be

a co-author. His insight and experience to text to speech synthesis enriched this book a great

deal. We also owe gratitude to many colleagues we worked with in the speech technology

group of Microsoft Research. In alphabetic order, Bruno Alabiso, Fil Alleva, Ciprian

Chelba, James Droppo, Doug Duchene, Li Deng, Joshua Goodman, Mei-Yuh Hwang, Derek

Jacoby, Y.C. Ju, Li Jiang, Ricky Loynd, Milind Mahajan, Peter Mau, Salman Mughal, Mike

Plumpe, Scott Quinn, Mike Rozak, Gina Venolia, Kuansan Wang, and Ye-Yi Wang, not

only developed many algorithms and systems described in this book, but also helped to

shape our thoughts from the very beginning.

In addition to those people, we want to thank Les Atlas, Alan Black, Jeff Bilmes,

David Caulton, Eric Chang, Phil Chou, Dinei Florencio, Allen Gersho, Francisco Gimenez-

Galanes, Hynek Hermansky, Kai-Fu Lee, Henrique Malvar, Mari Ostendorf, Joseph Pen-

theroudakis, Tandy Trower, Wayne Ward, and Charles Wayne. They provided us with many

wonderful comments to refine this book. Tim Moore and Russ Hall at Prentice Hall helped

us finish this book in a finite amount of time.

Finally, writing this book was a marathon that could not have been finished without the

support of our spouses Yingzhi, Donna, and Phen, during the many evenings and weekends

we spent on this project.

Redmond, WA Xuedong Huang

October 2000 Alejandro Acero

Hsiao-Wuen Hon

剩余964页未读，继续阅读

david0qian

粉丝: 0
资源: 3

语音识别技术：从基础理论到系统实现

语音处理基础与系统架构

黄学东指南：语音处理详解与系统架构

语音识别技术详解：从入门到精通

Spoken language Processing

Spoken Language Processing

Spoken Language Processing A Guide to Theory Algorithm and System Development

Spoken Language Processing A Guide to Theory Algorithm+and System Develop

黄学东Spoken Language Processing-A Guide to Theory, Algorithm and System Development.pdf

Spoken_Language_Processing

Spoken_Language_Processing_-_Guide_to_Algorithms_and_System_Development

最新资源