语音语言处理：理论、算法与系统开发指南

需积分: 10 122 浏览量更新于2024-07-29 收藏 10.82MB PDF 举报

"《口语处理：理论、算法与系统开发指南》是一本深入探讨口语处理技术的书籍，涵盖了从基础理论到实际系统开发的各个方面。该书旨在介绍语音接口、语音到语音翻译以及知识伙伴等应用背后的科学和技术。书中详细阐述了口语处理系统的架构，包括自动语音识别、文本转语音、口语理解等核心组件。全书分为五个部分，分别关注基础理论、语音处理、语音识别、文本转语音系统和口语系统。目标读者群体包括对口语处理感兴趣的科研人员、工程师和学生。历史背景的回顾和进一步阅读的建议也一并提供，帮助读者深化理解和探索更多相关知识。" 本书的核心知识点如下： 1. **口语接口**：是一种允许用户通过自然语言与机器交互的方式，如智能助手或语音识别设备。这些系统通常依赖于高效的自动语音识别（ASR）和口语理解（SLU）技术。 2. **语音到语音翻译**：是将一种语言的语音直接转换为另一种语言的语音，无需文字中介，为跨语言交流提供了便利。 3. **知识伙伴**：指能理解和响应人类语言，提供信息和帮助的智能系统，如AI聊天机器人，它们需要强大的自然语言处理能力。 4. **口语处理系统架构**： - **自动语音识别**（ASR）：是将录音的语音转换为文本的过程，涉及声学建模、语言建模和解码等多个步骤。 - **文本转语音**（TTS）：将文本数据转化为可听的语音，用于语音合成，包括单元选择、拼接和参数合成等技术。 - **口语理解**（SLU）：解析口语输入，提取语义信息，是理解用户意图的关键。 5. **基础理论**：包括声音的基本原理、人类发音系统和听觉感知，以及语音的音韵学和音系学： - **声音与人类发音系统**：讲解声音的物理特性，以及如何通过口腔、喉咙等部位生成语音。 - **音韵学与音系学**：研究语言中的音素（phonemes），音素在不同语境下的变体（allophones），以及言语速率和协同发音的影响。 - **音节与词汇**：探讨音节结构和单词的构建，这对于语音识别和合成至关重要。 - **句法与语义**：分析句子的结构成分（句法构成）和词汇的意义角色，这是理解和生成自然语言的基础。 6. **语法和语义**： - **句法成分**：讨论语言中构成句子的基本单位，如主语、谓语、宾语等。 - **语义角色**：解释词汇在句子中所扮演的逻辑角色，如动作执行者、受事者等，对于理解语境意义至关重要。通过这本书，读者不仅可以学习到口语处理的理论基础，还能掌握实际系统开发的关键技术和方法，对于深入理解和应用这一领域有着极大的帮助。

Foreword

Recognition and understanding of spontane-

ous unrehearsed speech remains an elusive goal. To understand speech, a human considers

not only the specific information conveyed to the ear, but also the context in which the in-

formation is being discussed. For this reason, people can understand spoken language even

when the speech signal is corrupted by noise. However, understanding the context of speech

is, in turn, based on a broad knowledge of the world. And this has been the source of the

difficulty and over forty years of research.

It is difficult to develop computer programs that are sufficiently sophisticated to under-

stand continuous speech by a random speaker. Only when programmers simplify the prob-

lem—by isolating words, limiting the vocabulary or number of speakers, or constraining the

way in which sentences may be formed—is speech recognition by computer possible.

Since the early 1970s, researchers at ATT, BBN, CMU, IBM, Lincoln Labs, MIT, and

SRI have made major contributions in Spoken Language Understanding Research. In 1971,

the Defense Advanced Research Projects Agency (Darpa) initiated an ambitious five-year,

$15 million, multisite effort to develop speech-understanding systems. The goals were to

develop systems that would accept continuous speech from many speakers, with minimal

speaker adaptation, and operate on a 1000-word vocabulary, artificial syntax, and a con-

2 Foreword

strained task domain. Two of the systems, Harpy and Hearsay-II, both developed at Came-

gie-Mellon University, achieved the original goals and in some instances surpassed them.

During the last three decades I have been at Carnegie Mellon, I have been very fortu-

nate to be able to work with many brilliant students and researchers. Xuedong Huang, Alex

Acero and Hsiao-Wuen Hon were arguably among the outstanding researchers in the speech

group at CMU. Since then they have moved to Microsoft and have put together a world-class

team at Microsoft Research. Over the years, they have contributed with standards for build-

ing spoken language understanding systems with Microsoft’s SAPI/SDK family of products,

and pushed the technologies forward with the rest of the community. Today, they continue to

play a premier leadership role in both the research community and in industry.

The new book “Spoken Language Processing” by Huang, Acero and Hon represents a

welcome addition to the technical literature on this increasingly important emerging area of

Information Technology. As we move from desktop PCs to personal digital assistants

(PDAs), wearable computers, and Internet cell phones, speech becomes a central, if not the

only, means of communication between the human and machine! Huang, Acero, and Hon

have undertaken a commendable task of creating a comprehensive reference manuscript cov-

ering theoretical, algorithmic and systems aspects of spoken language tasks of recognition,

synthesis and understanding.

The task of spoken language communication requires a system to recognize, interpret,

execute and respond to a spoken query. This task is complicated by the fact that the speech

signal is corrupted by many sources: noise in the background, characteristics of the micro-

phone, vocal tract characteristics of the speakers, and differences in pronunciation. In addi-

tion the system has to cope with non-grammaticality of spoken communication and ambigu-

ity of language. To solve the problem, an effective system must strive to utilize all the avail-

able sources of knowledge, i.e., acoustics, phonetics and phonology, lexical, syntactic and

semantic structure of language, and task specific context dependent information.

Speech is based on a sequence of discrete sound segments that are linked in time.

These segments, called phonemes, are assumed to have unique articulatory and acoustic

characteristics. While the human vocal apparatus can produce an almost infinite number of

articulatory gestures, the number of phonemes is limited. English as spoken in the United

States, for example, contains 16 vowel and 24 consonant sounds. Each phoneme has distin-

guishable acoustic characteristics and, in combination with other phonemes, forms larger

units such as syllables and words. Knowledge about the acoustic differences among these

sound units is essential to distinguish one word from another, say “bit” from “pit.”

When speech sounds are connected to form larger linguistic units, the acoustic charac-

teristics of a given phoneme will change as a function of its immediate phonetic environment

because of the interaction among various anatomical structures (such as the tongue, lips, and

vocal chords) and their different degrees of sluggishness. The result is an overlap of phone-

mic information in the acoustic signal from one segment to the other. For example, the same

underlying phoneme “t” can have drastically different acoustic characteristics in different

words, say, in “tea,” “tree,” “city,” “beaten.” and “steep.” This effect, known as coarticula-

tion, can occur within a given word or across a word boundary. Thus, the word “this” will

have very different acoustic properties in phrases such as “this car” and “this ship.”

PREFACE

Our primary motivation in writing this book

is to share our working experience to bridge the gap between the knowledge of industry gu-

rus and newcomers to the spoken language processing community. Many powerful tech-

niques hide in conference proceedings and academic papers for years before becoming

widely recognized by the research community or the industry. We spent many years pursuing

spoken language technology research at Carnegie Mellon University before we started spo-

ken language R&D at Microsoft. We fully understand that it is by no means a small under-

taking to transfer a state of the art spoken language research system into a commercially vi-

able product that can truly help people improve their productivity. Our experience in both

industry and academia is reflected in the context of this book, which presents a contemporary

and comprehensive description of both theoretic and practical issues in spoken language

processing. This book is intended for people of diverse academic and practical backgrounds.

Speech scientists, computer scientists, linguists, engineers, physicists and psychologists all

have a unique perspective to spoken language processing. This book will be useful to all of

these special interest groups.

Spoken language processing is a diverse subject that relies on knowledge of many lev-

els, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and dis-

course. The diverse nature of spoken language processing requires knowledge in computer

science, electrical engineering, mathematics, syntax, and psychology. There are a number of

excellent books on the sub-fields of spoken language processing, including speech recogni-

tion, text to speech conversion, and spoken language understanding, but there is no single

book that covers both theoretical and practical aspects of these sub-fields and spoken lan-

guage interface design. We devote many chapters systematically introducing fundamental

theories needed to understand how speech recognition, text to speech synthesis, and spoken

ii Preface

language understanding work. Even more important is the fact that the book highlights what

works well in practice, which is invaluable if you want to build a practical speech recognizer,

a practical text to speech synthesizer, or a practical spoken language system. Using numer-

ous real examples in developing Microsoft’s spoken language systems, we concentrate on

showing how the fundamental theories can be applied to solve real problems in spoken lan-

guage processing.

We would like to thank many people who helped us during our spoken language proc-

essing R&D careers. We are particularly indebted to Professor Raj Reddy at the School of

Computer Science, Carnegie Mellon University. Under his leadership, Carnegie Mellon Uni-

versity has become a center of research excellence on spoken language processing. Today’s

computer industry and academia benefited tremendously from his leadership and contribu-

tions.

Special thanks are due to Microsoft for its encouragement of spoken language R&D.

The management team at Microsoft has been extremely generous to our. We are particularly

grateful to Bill Gates, Nathan Myhrvold, Rick Rashid, Dan Ling, and Jack Breese for the

great environment they created for us at Microsoft Research.

Scott Meredith helped us writing a number of chapters in this book and deserves to be

a co-author. His insight and experience to text to speech synthesis enriched this book a great

deal. We also owe gratitude to many colleagues we worked with in the speech technology

group of Microsoft Research. In alphabetic order, Bruno Alabiso, Fil Alleva, Ciprian

Chelba, James Droppo, Doug Duchene, Li Deng, Joshua Goodman, Mei-Yuh Hwang, Derek

Jacoby, Y.C. Ju, Li Jiang, Ricky Loynd, Milind Mahajan, Peter Mau, Salman Mughal, Mike

Plumpe, Scott Quinn, Mike Rozak, Gina Venolia, Kuansan Wang, and Ye-Yi Wang, not

only developed many algorithms and systems described in this book, but also helped to

shape our thoughts from the very beginning.

In addition to those people, we want to thank Les Atlas, Alan Black, Jeff Bilmes,

David Caulton, Eric Chang, Phil Chou, Dinei Florencio, Allen Gersho, Francisco Gimenez-

Galanes, Hynek Hermansky, Kai-Fu Lee, Henrique Malvar, Mari Ostendorf, Joseph Pen-

theroudakis, Tandy Trower, Wayne Ward, and Charles Wayne. They provided us with many

wonderful comments to refine this book. Tim Moore and Russ Hall at Prentice Hall helped

us finish this book in a finite amount of time.

Finally, writing this book was a marathon that could not have been finished without the

support of our spouses Yingzhi, Donna, and Phen, during the many evenings and weekends

we spent on this project.

Redmond, WA Xuedong Huang

October 2000 Alejandro Acero

Hsiao-Wuen Hon

剩余964页未读，继续阅读

lonelyinheart

粉丝: 0
资源: 6

语音语言处理：理论、算法与系统开发指南

Spoken Language Processing A Guide to Theory Algorithm+and System Develop

Spoken language Processing

黄学东Spoken Language Processing-A Guide to Theory, Algorithm and System Development.pdf

Spoken Language Processing

语音识别 spoken language processing

Spoken Language Processing(黄学东,洪小文)

Spoken_Language_Processing

ROS By Example A Do It Yourself Guide to the Robot Operating System.pdf

Spoken-Language-Processing.rar_语音合成_matlab_

Advances in Natural Language Processing

最新资源