语音识别：基于加权有限状态转换器的算法

需积分: 10 28 浏览量更新于2024-07-17 收藏 1.33MB PDF 举报

"《语音识别算法使用加权有限状态转换器》是一本深入探讨WFST在语音识别领域应用的专业教程，由B.H. Juang担任系列编辑，Takaaki Hori和Atsushi Nakamura撰写。该书详细阐述了如何利用WFST技术来提升语音识别的性能和效率。" 本书详细介绍了加权有限状态转换器（Weighted Finite-State Transducers, WFST）在语音识别算法中的核心概念和应用。WFST是一种数学模型，常用于处理符号序列的转换问题，特别适用于处理有状态的、带有权重的转换过程，如语言模型和声学模型的建模。在语音识别系统中，WFST能够有效地结合这些模型，实现高效的搜索策略。 WFST的主要优势在于它们能够对多个复杂的转换过程进行组合和优化，这在处理语音识别时的声学到语言映射中尤为重要。书中会详细讨论如何构建和操作WFST，包括它们的构造、权重计算、最短路径查找等关键步骤。此外，还可能涵盖如何将WFST应用于HMM（隐马尔可夫模型）和其他概率模型，以实现更准确的语音识别。书中可能还会涉及以下主题： 1. 声学建模：讲解如何使用WFST来表示和处理不同声学特征，以及如何通过训练数据学习这些模型。 2. 语言模型：介绍如何构建和集成WFST语言模型，以提高识别的正确性和流畅性。 3. 搜索算法：详细阐述如何使用WFST进行动态规划搜索，如Viterbi搜索和Kasai的最短路径算法。 4. 结合解码：讲解如何利用WFST将声学模型和语言模型结合起来，进行有效的解码过程。 5. 性能优化：讨论如何通过WFST的压缩和优化技术来减少计算复杂性和存储需求。 6. 应用实例：可能提供实际的语音识别系统设计和实验，展示WFST在实际场景中的应用效果。通过对WFST的深入理解和应用，读者可以掌握构建高效、灵活的语音识别系统的关键技术。这本书适合于对语音识别感兴趣的科研人员、工程师以及研究生阅读，能够帮助他们深化理解并提升在这一领域的专业技能。

2 1. INTRODUCTION

Thus, the speech recognition process is interpreted as the decoding of the message text from the

observed signals.

Speech signals vary in both the time and spectral domains. When single words are spoken, the

speaking rate changes utterance by utterance. The spectral feature also changes depending on the

speaker and the recording environment. Therefore, even if the same word is reiterated by the same

speaker, the speech waveform will not completely match the previous one. In addition, the waveform

does not contain explicit information indicating boundaries between phones, and it expands or

contracts nonlinearly in the time domain. Consequently, it is computationally expensive to compare

such speech waveforms properly while considering their nonlinear time alignment even in a simple

task such as isolated word recognition. In spite of this large computational cost, speech recognition

systems were usually required to work in real time on a standard computer. Thus, reducing the

amount of computation required has constituted a major research topic in the speech recognition

ﬁeld.

In the early 1970s, the Dynamic Programming (DP) matching algorithm was introduced

to compare a reference speech signal and an input speech signal efﬁciently for speech recogni-

tion [SC70, SC71].

Although DP matching is an efﬁcient algorithm that takes account of the

nonlinear expansion and contraction of a speech signal using the Dynamic Programming tech-

nique [BD62], it still needs O(|R|×|S|) computation where |R| and |S| denote the lengths of the

reference speech signal R and input speech signal S, respectively. For isolated word recognition with

vocabulary V , the complexity of recognizing one spoken word is O(|V |×

|R|×|S|), where |V |

indicates the vocabulary size and

|R| is the average length of the reference speech signals for words

in V .

The research target moved to Large-vocabulary Continuous-speech Recognition (LVCSR) in

the 1990s. In those days, a statistical framework was already a mainstream of the technology, which

was supported with large corpora. The Defense Advanced Research Projects Agency (DARPA

or ARPA) in the United States boosted the LVCSR research by undertaking a series of large

projects.Many benchmark tests were designed for several speech data targets as part of those projects.

Large-scale corpora were also intensively collected to acquire statistical models. Major institutes and

universities developed LVCSR systems using these common corpora, and competed with each other

in terms of accuracy and speed. As a result of those projects, the vocabulary size increased from 1K to

65K. The target speech included read newspaper articles, Broadcast News (BN), and conversational

speech over the telephone. Thus, the LVCSR projects aimed to develop general-purpose speech

recognition systems by targeting a wide range of speech less restricted by vocabulary, speaker, and

speaking style.

However, LVCSR needs a large amount of computation. There is a synergy involved in

increasing vocabulary size and coping with continuous speech that increases the complexity of

decoding. A continuous speech signal contains multiple words, and yet the number of words and

their time boundaries are unknown. Even if the system knows the number of spoken words to be L,

DP matching is also called Dynamic Time Warping (DTW) in the speech recognition ﬁeld.

1.1. SPEECH RECOGNITION AND COMPUTATION 3

the number of hypotheses to be compared is |V |

(e.g., |V |

= 10

for |V |=1000 and L = 5).

If L is unknown, i.e., in a general case, it results in



1≤l≤

|V |

, where

L is an assumed upper

bound of L. If we enumerate all the hypotheses and compare each of them with the input signal,

the complexity becomes O(|V |

|R|×|S|). Thus, dealing with a large vocabulary in continuous

speech recognition potentially has a large impact on the amount of computation required in the

decoding process.

In fact, we do not have to enumerate all the hypotheses in LVCSR. Instead, we can use

one-pass DP matching [BBC82, Ney84], which is a basic but efﬁcient approach to continuous

speech recognition. This method is called one-pass Viterbi algorithm when probabilistic models

such as HMMs and an n-gram language model are used. If the HMMs are typical left-to-right

type, i.e., each state has only one self loop and one exiting transition, the computational complexity

is O(|V |

N−1

|M|×|S|), where n is typically 2 or 3 and

|M| is the number of HMM states per

word. Thus, the total computation is much less than that required for the full enumeration.

Most current speech recognition decoders are based on one-pass Viterbi algorithm (for details

see Chapter 2). The algorithm ensures that the best hypothesis for a speech signal is found with

given acoustic and language models. However, it is expensive to search for the best sequence of

words from among a large vocabulary of over 10 thousand. In DARPA projects, search strategies for

efﬁcient decoding, which do not necessarily ensure the best hypothesis, were intensively investigated

together with highly accurate acoustic and language models.

The most practical approach to reducing the computation for LVCSR decoding is to abandon

the veriﬁcation of all possible hypotheses. Beam search is the most popular method for reducing

the number of hypotheses veriﬁed during the decoding process, which was originally introduced in

1976 [Low76]. With the Viterbi algorithm, partial sentence hypotheses are extended synchronously

with time from the beginning of the speech. With the beam search, relatively unpromising partial

hypotheses are selected and pruned out at each time frame. Those pruned hypotheses are no longer

extended. As a result, the amount of computation can be reduced signiﬁcantly since only some of

the hypotheses are evaluated until the end of the speech signal. However, there is a potential risk

that the correct partial hypothesis that would become the best sentence hypothesis may be lost by

pruning.To eliminate such pruning errors, the beam search has been improved with various methods

such as look-ahead techniques.

On the other hand, the efﬁcient representation of speech information was also investigated

to reduce redundant search space. A tree-organized lexicon was successfully introduced to represent

the LVCSR search space. It is a data structure that shares pronunciation preﬁxes of the words

in the vocabulary as a preﬁx tree (or called trie). This structure is also effective in alleviating the

upswing of hypotheses at word boundaries, which causes an increase in pruning errors. Without a

tree-organized lexicon, there are |V | possible branches from the end of each word when extending

partial hypotheses. By using this tree structure, the number of branches decreases to at most the

number of phones.

4 1. INTRODUCTION

In those days, many types of search strategies were proposed for the LVCSR decoding prob-

lem. Researchers studied stack-based time-asynchronous approaches such as A

∗

-stack decoding

[KHG

91, Pau91], envelope search with fast acoustic match [GBM95], and multi-stack decod-

ing [Sch00], which were performed differently from the one-pass Viterbi algorithm. The aim of

these approaches was both to reduce the amount of computation and to reduce memory usage,

which was severely limited by the computers of the day. The various search strategies for LVCSR

are well summarized in [Aub02].

Multi-pass search strategies were also intensively investigated. A multi-pass search usually

employs rough models that need less computation in the ﬁrst pass to generate a set of promising

sentence hypotheses. Then it uses detailed models that need more computation in the second pass

to ﬁnd the best hypothesis from a small set of hypotheses. Forward-backward search [ASP91],

tree-trellis search [SH91], lattice N-best search [SC90], and word graph algorithm [ONA97]are

well known approaches. If these strategies are not being used for real-time applications, more passes

are often performed together with acoustic and language model adaptation to further improve the

recognition accuracy. Moreover, the efﬁcient representation of multiple hypotheses and how to

generate a better set of hypotheses in the ﬁrst pass decoding were also investigated at the same time.

Thus, LVCSR gradually became a reality along with the progress made on the decoding

technology and the computational power of hardware. Some commercial software for LVCSR had

appeared by the end of the 1990s, which was capable of taking dictation consisting of a user’s

continuous speech with a personal computer. For example, IBM ViaVoice(R) and Dragon Naturally

Speaking(R) are representative products. These products made speech recognition familiar to the

general population.

In the 2000s, a new paradigm has entered mainstream speech recognition technology, namely

the Weighted Finite-State Transducer (WFST) approach and this is the main theme of this

book. The framework was proposed in 1994 by researchers at AT&T Laboratories [PRS94, PR96,

MPR96]. After that, the technique was improved steadily and has been considered the most efﬁ-

cient and theoretically elegant approach to the LVCSR decoding problem [MPR02]. Recently most

major research institutes and universities have introduced this approach and are undertaking further

investigations. In this book, we describe WFST-based speech recognition and related algorithms

while paying attention to their differences when compared with traditional approaches.

1.2 WHY WFST?

Why has the WFST approach become so popular in the speech recognition ﬁeld? The reason could

be the aggressive use of automata theory. In other words, the approach takes full advantage of the

theory over the whole decoding scheme, while ordinary methods have used it only in a limited

fashion. This has resulted in an efﬁcient and elegant framework.

The automata theory is the study of abstract computing devices, or machines [AHU74,

HMU06], and it has long since entered mainstream computer science. Currently most undergrad-

uate students on computer science courses learn the fundamentals of the automata theory, because

1.2. WHY WFST? 5

automata are used in many areas such as logic circuits, data compression, cryptography, compilers,

and natural language processing.

The theory has often been used to deﬁne a set of symbol sequences as a language, where an

automaton is a way of expressing a grammar in formal language theory. For example, it is known

that a regular grammar can be represented as a ﬁnite automaton, and a context free grammar can be

represented as a push-down automaton.

A WFST is a sort of ﬁnite automaton. Roughly speaking, a basic ﬁnite automaton has a

ﬁnite set of states and transitions between states. In ordinary ﬁnite automata, each transition has

an input label. In addition, the WFST has an output label and a weight at the transition. Actually,

ﬁnite automata were already being used in speech recognition before the WFST approach appeared.

However, their use was limited to grammar representation as a language model. Since a WFST can to

some degree represent relations between input and output strings with a weight that can correspond

to some cost or probability, it can also represent relations between different levels of sequences such

as HMM states, phones, and words in a uniﬁed framework.

In WFST-based speech recognition,WFSTs are typically used to represent an acoustic model,

a pronunciation lexicon, and a language model, where the acoustic model WFST transduces an

acoustic state sequence into a phone sequence, the lexicon WFST transduces a phone sequence into

a word sequence, and the language model WFST transduces a word sequence into a sentence.Then

those WFSTs are integrated by composition and optimization operations to a single WFST that

directly transduces an acoustic state sequence into a sentence.

The WFST approach substantially increases the speed of most LVCSR tasks compared with

traditional LVCSR approaches. Where does the difference come from? Two reasons have been

presented.

1. Static search space organization

The integrated WFST can be viewed as a large search network, where speech recognition is

considered a search problem where the goal is to ﬁnd a path that best matches the speech

input signal in the network. This framework itself is trivial in speech recognition technology.

But with the WFST approach, the network is statically stored in the memory known as full

expansion, while traditional approaches usually construct a search network on demand, i.e.,

dynamic expansion. Since dynamic expansion needs a certain overhead during decoding, a fully

expanded network is better for faster decoding.

2. Optimization of search network

The search network often contains some redundancy. For example, different words often have

partially the same pronunciation and are separately compared with the input signal during

decoding. Such redundancy increases the number of computation and pruning errors when

using a beam search. In the WFST framework, the redundancy can be removed by employing

optimization operations such as weighted determinization and minimization.

6 1. INTRODUCTION

These two factors seem to be straightforward. However, an LVCSR search network is extremely large

and full expansion is actually almost impossible due to the limitation of memory size.Therefore,most

LVCSR systems in the 1990s did not adopt such a full-expansion approach.The WFST accomplishes

full expansion by reducing the search space with a series of optimization operations deﬁned on

WFSTs. As mentioned above, the optimization is also effective in removing the redundancy and

accelerating the search process.

Some techniques have already been used to reduce the redundancy even in non-WFST-based

speech recognition. A tree-organized lexicon is one example technique that is still widely used in

many LVCSR systems. In the WFST framework, such a data structure can be constructed through

optimization. The determinization of a WFST representing a lexicon yields a similar structure to

the preﬁx tree. However, the use of determinization is not limited to a lexicon. It can be applied to

the entire search space, i.e., the integrated WFST, and therefore the redundancy can be removed

more rigorously. Minimization is also available, which corresponds to sharing sufﬁxes in the case of

a lexicon.

Accordingly, the WFSTs yield highly-optimized speech recognition for fast decoding. With

the WFST framework, LVCSR has become faster than before. Currently real-time speech recog-

nition with an extremely large vocabulary of over 1M words has become possible on a standard

personal computer [HHM04, HHMN07].

On the other hand, simplicity is an attractive feature of the WFST framework. In speech

recognition, the construction of a search network is basically performed by embedding knowledge

sources hierarchically. An acoustic state sequence is embedded in a phone node, a phone node

sequence is embedded in a word node, a word node sequence is embedded in a language model,

and they are hierarchically associated to organize the search space. However, since the actual data

structure and algorithm are not explicitly deﬁned, the implementation changes system by system.

The decoding program also depends strongly on the data structure, which is specialized for the

models used in the system. Therefore, the system tends to have low expandability.

In the WFST framework, the construction process is well-explained by the composition

operation that combines different levels of string-to-string relations. WFSTs of an acoustic state

sequence to a phone sequence, a phone sequence to a word sequence, and a word sequence to a

sentence can be composed into a single WFST of an acoustic state sequence to a sentence, which

corresponds to the entire search space.

The actual construction of the integrated WFST, including the optimization step, is achieved

by a small number of operations.Thus, the construction process is explicitly deﬁned in this framework.

The decoding program simply ﬁnds a path that best matches the speech input in the WFST, where

the WFST acts as an interface between the models and the decoding program.

In addition, this framework provides the system with some ﬂexibility. The decoder becomes

more general. In many cases, we do not have to modify the decoding program to expand the function

of the system. We can focus solely on the WFST that realizes the function. For example, speech-

剩余163页未读，继续阅读

ferb2015

粉丝: 85
资源: 2

语音识别：基于加权有限状态转换器的算法

SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS （WFST））

Asteriskspeechrecognition.zip

wfst_compositon.docx

GMM-HMM语音识别matlab手写源码

HCI Experiment Speech Recognition

asrt_speechrecognition-master

github.com/easinal/target-recognition-and-tracking-of-vehicle-system

列举20个ml.net框架下第三方开源项目，关于人脸识别的。并给出下载地址

speechrecognition库中文

js如何检测h5 SpeechRecognition浏览器是否支持

最新资源