黄学东博士详解语音处理历史与DNN前技术，深入理解语音识别系统

需积分: 10 32 浏览量更新于2024-07-17 收藏 9.64MB PDF 举报

《口语语言处理》是一本由黄学东博士编著的专业书籍，旨在深入探讨语音处理领域的基础知识和技术，尤其是在深度神经网络（DNN）出现之前的传统语音识别方法。本书特别适合那些希望增进对语音技术理解的读者，无论是在学术研究、开发实践还是日常生活中寻求实用指导的人士。首先，作者在第一章中阐述了研究动机，包括语音接口的重要性，如提供自然交互方式，以及语音转语音翻译的潜在应用，以及与知识伙伴系统的集成。这些技术推动了语音处理技术的发展，使之成为连接人机交流的关键桥梁。接着，第二部分详细介绍了口语语言系统架构，主要包括自动语音识别（ASR）、文本转语音转换（TTS）和口语理解（SLU）。ASR是将语音信号转化为文本的过程，是整个系统的核心，包括特征提取、模型训练和解码等步骤。TTS则是将文本转换成可听的语音输出，涉及语音合成技术。SLU则负责理解和解析语音中的语义信息，通常包括词法分析和语法解析。在书的组织结构上，分为五个部分：第一部分介绍基础理论，包括声音的物理特性和人类说话机制、音素和音位、音节和单词的构成，以及语法和语义的基本原理。这部分为后续的技术实现提供了坚实的理论基础。第二部分深入探讨语音处理技术，涉及声学模型、语言模型、以及如何结合两者进行识别。这部分内容对于理解现代深度学习在语音识别中的角色至关重要，比如HMM-GMM（混合高斯模型）和DNN-HMM的演变。第三部分专门讲述语音识别技术，讲解了传统模板匹配、隐马尔科夫模型（HMM）以及更高级的统计建模方法，这些都是DNN技术兴起前的主要手段。这部分内容展示了技术演进的脉络。第四部分和第五部分分别关注文本转语音系统和口语语言系统的设计，包括语音合成技术和如何构建一个完整的口语交互系统，涵盖了语音合成引擎、语音合成策略和语音合成评估等多个方面。此外，书中的目标读者群广泛，包括学生、研究人员、工程师和任何对语音技术感兴趣的业余爱好者。作者还提供了一个历史视角，讨论了该领域的发展历程，并推荐进一步阅读的相关文献，帮助读者追踪最新的研究成果和发展动态。《口语语言处理》是一本实用且详尽的指南，不仅概述了语音处理的基础知识，还深入剖析了技术细节，为读者揭示了语音技术背后的科学原理和实际应用，是一本值得收藏的居家旅行学习良品。

2 Foreword

strained task domain. Two of the systems, Harpy and Hearsay-II, both developed at Came-

gie-Mellon University, achieved the original goals and in some instances surpassed them.

During the last three decades I have been at Carnegie Mellon, I have been very fortu-

nate to be able to work with many brilliant students and researchers. Xuedong Huang, Alex

Acero and Hsiao-Wuen Hon were arguably among the outstanding researchers in the speech

group at CMU. Since then they have moved to Microsoft and have put together a world-class

team at Microsoft Research. Over the years, they have contributed with standards for build-

ing spoken language understanding systems with Microsoft’s SAPI/SDK family of products,

and pushed the technologies forward with the rest of the community. Today, they continue to

play a premier leadership role in both the research community and in industry.

The new book “Spoken Language Processing” by Huang, Acero and Hon represents a

welcome addition to the technical literature on this increasingly important emerging area of

Information Technology. As we move from desktop PCs to personal digital assistants

(PDAs), wearable computers, and Internet cell phones, speech becomes a central, if not the

only, means of communication between the human and machine! Huang, Acero, and Hon

have undertaken a commendable task of creating a comprehensive reference manuscript cov-

ering theoretical, algorithmic and systems aspects of spoken language tasks of recognition,

synthesis and understanding.

The task of spoken language communication requires a system to recognize, interpret,

execute and respond to a spoken query. This task is complicated by the fact that the speech

signal is corrupted by many sources: noise in the background, characteristics of the micro-

phone, vocal tract characteristics of the speakers, and differences in pronunciation. In addi-

tion the system has to cope with non-grammaticality of spoken communication and ambigu-

ity of language. To solve the problem, an effective system must strive to utilize all the avail-

able sources of knowledge, i.e., acoustics, phonetics and phonology, lexical, syntactic and

semantic structure of language, and task specific context dependent information.

Speech is based on a sequence of discrete sound segments that are linked in time.

These segments, called phonemes, are assumed to have unique articulatory and acoustic

characteristics. While the human vocal apparatus can produce an almost infinite number of

articulatory gestures, the number of phonemes is limited. English as spoken in the United

States, for example, contains 16 vowel and 24 consonant sounds. Each phoneme has distin-

guishable acoustic characteristics and, in combination with other phonemes, forms larger

units such as syllables and words. Knowledge about the acoustic differences among these

sound units is essential to distinguish one word from another, say “bit” from “pit.”

When speech sounds are connected to form larger linguistic units, the acoustic charac-

teristics of a given phoneme will change as a function of its immediate phonetic environment

because of the interaction among various anatomical structures (such as the tongue, lips, and

vocal chords) and their different degrees of sluggishness. The result is an overlap of phone-

mic information in the acoustic signal from one segment to the other. For example, the same

underlying phoneme “t” can have drastically different acoustic characteristics in different

words, say, in “tea,” “tree,” “city,” “beaten.” and “steep.” This effect, known as coarticula-

tion, can occur within a given word or across a word boundary. Thus, the word “this” will

have very different acoustic properties in phrases such as “this car” and “this ship.”

PREFACE

Our primary motivation in writing this book

is to share our working experience to bridge the gap between the knowledge of industry gu-

rus and newcomers to the spoken language processing community. Many powerful tech-

niques hide in conference proceedings and academic papers for years before becoming

widely recognized by the research community or the industry. We spent many years pursuing

spoken language technology research at Carnegie Mellon University before we started spo-

ken language R&D at Microsoft. We fully understand that it is by no means a small under-

taking to transfer a state of the art spoken language research system into a commercially vi-

able product that can truly help people improve their productivity. Our experience in both

industry and academia is reflected in the context of this book, which presents a contemporary

and comprehensive description of both theoretic and practical issues in spoken language

processing. This book is intended for people of diverse academic and practical backgrounds.

Speech scientists, computer scientists, linguists, engineers, physicists and psychologists all

have a unique perspective to spoken language processing. This book will be useful to all of

these special interest groups.

Spoken language processing is a diverse subject that relies on knowledge of many lev-

els, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and dis-

course. The diverse nature of spoken language processing requires knowledge in computer

science, electrical engineering, mathematics, syntax, and psychology. There are a number of

excellent books on the sub-fields of spoken language processing, including speech recogni-

tion, text to speech conversion, and spoken language understanding, but there is no single

book that covers both theoretical and practical aspects of these sub-fields and spoken lan-

guage interface design. We devote many chapters systematically introducing fundamental

theories needed to understand how speech recognition, text to speech synthesis, and spoken

ii Preface

language understanding work. Even more important is the fact that the book highlights what

works well in practice, which is invaluable if you want to build a practical speech recognizer,

a practical text to speech synthesizer, or a practical spoken language system. Using numer-

ous real examples in developing Microsoft’s spoken language systems, we concentrate on

showing how the fundamental theories can be applied to solve real problems in spoken lan-

guage processing.

We would like to thank many people who helped us during our spoken language proc-

essing R&D careers. We are particularly indebted to Professor Raj Reddy at the School of

Computer Science, Carnegie Mellon University. Under his leadership, Carnegie Mellon Uni-

versity has become a center of research excellence on spoken language processing. Today’s

computer industry and academia benefited tremendously from his leadership and contribu-

tions.

Special thanks are due to Microsoft for its encouragement of spoken language R&D.

The management team at Microsoft has been extremely generous to our. We are particularly

grateful to Bill Gates, Nathan Myhrvold, Rick Rashid, Dan Ling, and Jack Breese for the

great environment they created for us at Microsoft Research.

Scott Meredith helped us writing a number of chapters in this book and deserves to be

a co-author. His insight and experience to text to speech synthesis enriched this book a great

deal. We also owe gratitude to many colleagues we worked with in the speech technology

group of Microsoft Research. In alphabetic order, Bruno Alabiso, Fil Alleva, Ciprian

Chelba, James Droppo, Doug Duchene, Li Deng, Joshua Goodman, Mei-Yuh Hwang, Derek

Jacoby, Y.C. Ju, Li Jiang, Ricky Loynd, Milind Mahajan, Peter Mau, Salman Mughal, Mike

Plumpe, Scott Quinn, Mike Rozak, Gina Venolia, Kuansan Wang, and Ye-Yi Wang, not

only developed many algorithms and systems described in this book, but also helped to

shape our thoughts from the very beginning.

In addition to those people, we want to thank Les Atlas, Alan Black, Jeff Bilmes,

David Caulton, Eric Chang, Phil Chou, Dinei Florencio, Allen Gersho, Francisco Gimenez-

Galanes, Hynek Hermansky, Kai-Fu Lee, Henrique Malvar, Mari Ostendorf, Joseph Pen-

theroudakis, Tandy Trower, Wayne Ward, and Charles Wayne. They provided us with many

wonderful comments to refine this book. Tim Moore and Russ Hall at Prentice Hall helped

us finish this book in a finite amount of time.

Finally, writing this book was a marathon that could not have been finished without the

support of our spouses Yingzhi, Donna, and Phen, during the many evenings and weekends

we spent on this project.

Redmond, WA Xuedong Huang

October 2000 Alejandro Acero

Hsiao-Wuen Hon

剩余963页未读，继续阅读

娇憨的波光

粉丝: 0
资源: 1

黄学东博士详解语音处理历史与DNN前技术，深入理解语音识别系统

Spoken Language Processing(黄学东,洪小文)

Spoken Language Processing A Guide to Theory Algorithm and System Development

Spoken-Language-Processing.rar_语音合成_matlab_

INTERSPEECH 2023

enframe函数matlab代码

melbankm函数的代码

cued speech

English speaking text

google bard

FEEDBACK_SPOKEN

最新资源