HTK工具包：入门语音识别与HMM-GMM声学模型

需积分: 9 116 浏览量更新于2024-07-14 收藏 2.47MB PDF 举报

“HTK（Hidden Markov Model Toolkit）是一套用于构建隐马尔科夫模型（HMM）的工具包，适用于任何时间序列的建模。该工具的核心功能具有通用性，尤其在语音识别领域中广泛应用。” 《HTK Book》是由多位专家共同编写的关于HTK工具包的指南，涵盖了从基础到高级的多个方面。这本书是针对HTK版本3.3修订的，包含了从1995年至2005年的多次更新，提供了对HMMs和语音识别的深入理解。 1. 隐马尔科夫模型（HMM）基本原理： HMM是一种统计模型，常用于表示和分析时变过程，如语音信号。其基本思想是将观察序列视为由不可见状态序列驱动的结果。每个状态可能产生特定的观测值，并且状态之间可以通过概率转移。 2. 孤立词识别：在语音识别中，孤立词识别是指识别单独出现的单词，不考虑上下文信息。这通常是语音识别的基础，对于简单的命令和控制应用很有用。 3. 输出概率指定： HMMs通过定义状态到观测的发射概率和状态之间的转移概率来确定模型。输出概率是指模型从某个状态生成观测值的概率。 4. 贝叶斯-韦尔奇重估计（Baum-Welch Re-Estimation）：这是HMM参数学习的主要方法，通过EM（期望最大化）算法迭代优化模型参数，使其更接近实际数据的分布。 5. 识别与维特比解码（Viterbi Decoding）：维特比算法是HMM中最优路径搜索算法，用于找到最有可能产生给定观测序列的状态序列，是语音识别中的关键步骤。 6. 连续语音识别：相对于孤立词识别，连续语音识别处理连续的语音流，需要处理词汇之间的上下文关联和停顿，通常涉及更复杂的模型和解码策略。 7. 说话人适应（Speaker Adaptation）：说话人适应技术允许模型根据特定说话人的发音特征进行调整，以提高不同说话人之间的识别性能。 8. HTK软件架构与工具： HTK工具包包含一系列工具，用于数据预处理、模型训练、识别系统构建等。软件架构设计为模块化，便于使用和扩展。数据预处理工具处理原始音频数据，使其适合模型训练；训练工具则用于构建和优化HMMs。通过深入学习《HTK Book》，读者将能够理解和运用HTK来构建和优化自己的语音识别系统，涵盖从数据准备、模型训练到识别系统实现的全过程。此外，书中还讨论了说话人自适应、错误分析以及系统评估等高级主题，为进阶研究提供了坚实的基础。

展开

1.4 Baum-Welch Re-Estimation 7

independent. Furthermore, mixture components can be considered to be a special form of sub-state

in which the transition probabilities are the mixture weights (see Fig. 1.5).

Thus, the essential problem is to estimate the means and variances of a HMM in which each

state output distribution is a single component Gaussian, that is

) =

(2π)

|Σ

−

−µ

)

−1

−µ

)

(1.10)

If there was just one state j in the HMM, this parameter estimation would be easy. The maximum

likelihood estimates of µ

and Σ

would be just the simple averages, that is

t=1

(1.11)

and

t=1

− µ

)(o

− µ

)

(1.12)

In practice, of course, there are multiple states and there is no direct assignment of observation

vectors to individual states because the underlying state sequence is unknown. Note, however, that

if some approximate assignment of vectors to states could be made then equations 1.11 and 1.12

could be used to give the required initial values for the parameters. Indeed, this is exactly what

is done in the HTK tool called HInit. HInit ﬁrst divides the training observation vectors equally

amongst the model states and then uses equations 1.11 and 1.12 to give initial values for the mean

and variance of each state. It then ﬁnds the maximum likelihood state sequence using the Viterbi

algorithm described below, reassigns the observation vectors to states and then uses equations 1.11

and 1.12 again to get b etter initial values. This process is repeated until the estimates do not

change.

Since the full likelihood of each observation sequence is based on the summation of all possi-

ble state sequences, each observation vector o

contributes to the computation of the maximum

likelihood parameter values for each state j. In other words, instead of assigning each observation

vector to a speciﬁc state as in the above approximation, each observation is assigned to every state

in proportion to the probability of the model being in that state when the vector was observed.

Thus, if L

(t) denotes the probability of being in state j at time t then the equations 1.11 and 1.12

given above become the following weighted averages

t=1

(t)o

t=1

(t)

(1.13)

and

t=1

(t)(o

− µ

)(o

− µ

)

t=1

(t)

(1.14)

where the summations in the denominators are included to give the required normalisation.

Equations 1.13 and 1.14 are the Baum-Welch re-estimation formulae for the means and covari-

ances of a HMM. A similar but slightly more complex formula can be derived for the transition

probabilities (see chapter 8).

Of course, to apply equations 1.13 and 1.14, the probability of state occupation L

(t) must

be calculated. This is done eﬃciently using the so-called Forward-Backward algorithm. Let the

forward probability

(t) for some model M with N states be deﬁned as

(t) = P (o

, . . . , o

, x(t) = j|M ). (1.15)

That is, α

(t) is the joint probability of observing the ﬁrst t speech vectors and being in state j at

time t. This forward probability can be eﬃciently calculated by the following recursion

(t) =

N−1

i=2

(t − 1)a

). (1.16)

Since the output distributions are densities, these are not really probabilities but it is a convenient ﬁction.

1.4 Baum-Welch Re-Estimation 8

This recursion depends on the fact that the probability of being in state j at time t and seeing

observation o

can be deduced by summing the forward probabilities for all possible predecessor

states i weighted by the transition probability a

. The slightly odd limits are caused by the fact

that states 1 and N are non-emitting

. The initial conditions for the above recursion are

(1) = 1 (1.17)

(1) = a

) (1.18)

for 1 < j < N and the ﬁnal condition is given by

(T ) =

N−1

i=2

(T )a

. (1.19)

Notice here that from the deﬁnition of α

(t),

P (O|M) = α

(T ). (1.20)

Hence, the calculation of the forward probability also yields the total likelihood P (O|M).

The backward probability β

(t) is deﬁned as

(t) = P (o

t+1

, . . . , o

|x(t) = j, M). (1.21)

As in the forward case, this backward probability can b e computed eﬃciently using the following

recursion

(t) =

N−1

j=2

t+1

)β

(t + 1) (1.22)

with initial condition given by

(T ) = a

(1.23)

for 1 < i < N and ﬁnal condition given by

(1) =

N−1

j=2

)β

(1). (1.24)

Notice that in the deﬁnitions above, the forward probability is a joint probability whereas the

backward probability is a conditional probability. This somewhat asymmetric deﬁnition is deliberate

since it allows the probability of state occupation to be determined by taking the pro duct of the

two probabilities. From the deﬁnitions,

(t)β

(t) = P (O, x(t) = j|M ). (1.25)

Hence,

(t) = P (x(t) = j|O, M) (1.26)

P (O, x(t) = j|M)

P (O|M)

(t)β

(t)

where P = P (O|M).

All of the information needed to perform HMM parameter re-estimation using the Baum-Welch

algorithm is now in place. The steps in this algorithm may be summarised as follows

1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator

and denominator summations of the form illustrated by equations 1.13 and 1.14. These storage

locations are referred to as accumulators

To understand equations involving a non-emitting state at time t, the time should be thought of as being t − δt

if it is an entry state, and t + δ t if it is an exit state. This becomes important when HMMs are connected together

in sequence so that transitions across non-emitting states take place between frames.

Note that normally the summations in the denominators of the re-estimation formulae are identical across the

parameter sets of a given state and therefore only a single common storage location for the denominators is required

and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can

result in the denominator summations being diﬀerent. Hence, in HTK the denominator summations are always

stored and calculated individually for each distinct parameter vector or matrix.

1.5 Recognition and Viterbi Decoding 9

2. Calculate the forward and backward probabilities for all states j and times t.

3. For each state j and time t, use the probability L

(t) and the current observation vector o

to update the accumulators for that state.

4. Use the ﬁnal accumulator values to calculate new parameter values.

5. If the value of P = P (O|M) for this iteration is not higher than the value at the previous

iteration then stop, otherwise repeat the above steps using the new re-estimated parameter

values.

All of the above assumes that the parameters for a HMM are re-estimated from a single ob-

servation sequence, that is a single example of the spoken word. In practice, many examples are

needed to get good parameter estimates. However, the use of multiple observation sequences adds

no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct

training sequence.

One ﬁnal point that should b e mentioned is that the computation of the forward and backward

probabilities involves taking the product of a large number of probabilities. In practice, this means

that the actual numbers involved become very small. Hence, to avoid numerical problems, the

forward-backward computation is computed in HTK using log arithmetic.

The HTK program which implements the above algorithm is called HRest. In combination

with the tool HInit for estimating initial values mentioned earlier, HRest allows isolated word

HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.

1.5 Recognition and Viterbi Decoding

The previous section has described the basic ideas underlying HMM parameter re-estimation using

the Baum-Welch algorithm. In passing, it was noted that the eﬃcient recursive algorithm for

computing the forward probability also yielded as a by-product the total likelihood P (O|M). Thus,

this algorithm could also be used to ﬁnd the model which yields the maximum value of P (O|M

and hence, it could be used for recognition.

In practice, however, it is preferable to base recognition on the maximum likelihood state se-

quence since this generalises easily to the continuous speech case whereas the use of the total

probability does not. This likelihood is computed using essentially the same algorithm as the for-

ward probability calculation except that the summation is replaced by a maximum operation. For

a given model M, let φ

(t) represent the maximum likelihood of observing speech vectors o

and being in state j at time t. This partial likelihood can be computed eﬃciently using the

following recursion (cf. equation 1.16)

(t) = max

{φ

(t − 1)a

). (1.27)

where

(1) = 1 (1.28)

(1) = a

) (1.29)

for 1 < j < N. The maximum likelihood

P (O|M ) is then given by

(T ) = max

{φ

(T )a

} (1.30)

As for the re-estimation case, the direct computation of likelihoods leads to underﬂow, hence,

log likelihoods are used instead. The recursion of equation 1.27 then becomes

(t) = max

{ψ

(t − 1) + log(a

)} + log(b

)). (1.31)

This recursion forms the basis of the so-called Viterbi algorithm. As shown in Fig. 1.6, this algorithm

can be visualised as ﬁnding the best path through a matrix where the vertical dimension represents

the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time).

Each large dot in the picture represents the log probability of observing that frame at that time and

each arc between dots corresponds to a log transition probability. The log probability of any path

1.6 Continuous Speech Recognition 10

is computed simply by summing the log transition probabilities and the log output probabilities

along that path. The paths are grown from left-to-right column-by-column. At time t, each partial

path ψ

(t − 1) is known for all states i, hence equation 1.31 can be used to compute ψ

(t) thereby

extending the partial paths by one time frame.

State

Speech

Fram e

(Tim e)

1 2 3 4 5 6

( )

Fig. 1.6 The Viterbi Algorithm for Isolated Word

Recognition

This concept of a path is extremely important and it is generalised below to deal with the

continuous speech case.

This completes the discussion of isolated word recognition using HMMs. There is no HTK tool

which implements the above Viterbi algorithm directly. Instead, a tool called HVite is provided

which along with its supp orting libraries, HNet and HRec, is designed to handle continuous

speech. Since this recogniser is syntax directed, it can also perform isolated word recognition as a

special case. This is discussed in more detail below.

1.6 Continuous Speech Recognition

Returning now to the conceptual model of speech production and recognition exempliﬁed by Fig. 1.1,

it should be clear that the extension to continuous speech simply involves connecting HMMs together

in sequence. Each model in the sequence corresponds directly to the assumed underlying symbol.

These could be either whole words for so-called connected speech recognition or sub-words such as

phonemes for continuous speech recognition. The reason for including the non-emitting entry and

exit states should now be evident, these states provide the glue needed to join models together.

There are, however, some practical diﬃculties to overcome. The training data for continuous

speech must consist of continuous utterances and, in general, the boundaries dividing the segments

of speech corresponding to each underlying sub-word model in the sequence will not be known. In

practice, it is usually feasible to mark the boundaries of a small amount of data by hand. All of

the segments corresponding to a given model can then be extracted and the isolated word style

of training described above can be used. However, the amount of data obtainable in this way is

usually very limited and the resultant models will be poor estimates. Furthermore, even if there

was a large amount of data, the boundaries imposed by hand-marking may not be optimal as far

as the HMMs are concerned. Hence, in HTK the use of HInit and HRest for initialising sub-word

models is regarded as a bootstrap operation

. The main training phase involves the use of a tool

called HERest which does embedded training.

Embedded training uses the same Baum-Welch procedure as for the isolated case but rather

than training each model individually all models are trained in parallel. It works in the following

steps:

They can even be avoided altogether by using a ﬂat start as described in section 8.3.

1.6 Continuous Speech Recognition 11

1. Allocate and zero accumulators for all parameters of all HMMs.

2. Get the next training utterance.

3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol

transcription of the training utterance.

4. Calculate the forward and backward probabilities for the composite HMM. The inclusion

of intermediate non-emitting states in the composite model requires some changes to the

computation of the forward and backward probabilities but these are only minor. The details

are given in chapter 8.

5. Use the forward and backward probabilities to compute the probabilities of state occupation

at each time frame and update the accumulators in the usual way.

6. Repeat from 2 until all training utterances have been processed.

7. Use the accumulators to calculate new parameter estimates for all of the HMMs.

These steps can then all be repeated as many times as is necessary to achieve the required conver-

gence. Notice that although the location of symbol boundaries in the training data is not required

(or wanted) for this procedure, the symbolic transcription of each training utterance is needed.

Whereas the extensions needed to the Baum-Welch procedure for training sub-word models are

relatively minor

, the corresponding extensions to the Viterbi algorithm are more substantial.

In HTK, an alternative formulation of the Viterbi algorithm is used called the Token Passing

Model

. In brief, the token passing model makes the concept of a state alignment path explicit.

Imagine each state j of a HMM at time t holds a single moveable token which contains, amongst

other information, the partial log probability ψ

(t). This token then represents a partial match

between the observation sequence o

to o

and the model subject to the constraint that the model

is in state j at time t. The path extension algorithm represented by the recursion of equation 1.31

is then replaced by the equivalent token passing algorithm which is executed at each time frame t.

The key steps in this algorithm are as follows

1. Pass a copy of every token in state i to all connecting states j, incrementing the log probability

of the copy by log[a

] + log [ b

(o(t)].

2. Examine the tokens in every state and discard all but the token with the highest probability.

In practice, some modiﬁcations are needed to deal with the non-emitting states but these are

straightforward if the tokens in entry states are assumed to represent paths extended to time t −δt

and tokens in exit states are assumed to represent paths extended to time t + δt.

The point of using the Token Passing Model is that it extends very simply to the continuous

speech case. Suppose that the allowed sequence of HMMs is deﬁned by a ﬁnite state network. For

example, Fig. 1.7 shows a simple network in which each word is deﬁned as a sequence of phoneme-

based HMMs and all of the words are placed in a loop. In this network, the oval boxes denote HMM

instances and the square boxes denote word-end nodes. This composite network is essentially just

a single large HMM and the above Token Passing algorithm applies. The only diﬀerence now is

that more information is needed b eyond the log probability of the best token. When the best token

reaches the end of the speech, the route it took through the network must be known in order to

recover the recognised sequence of models.

In practice, a good deal of extra work is needed to achieve eﬃcient operation on large training databases. For

example, the HERest tool includes facilities for pruning on both the forward and backward passes and parallel

operation on a network of machines.

See “Token Passing: a Conceptual Model for Connected Speech Recognition Systems”, SJ Young, NH Russell and

JHS Thornton, CUED Technical Report F INFENG/TR38, Cambridge University, 1989. Available by anonymous

ftp from svr-ftp.eng.cam.ac.uk.

剩余353页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

yuchiwang

粉丝: 85

HTK工具包：入门语音识别与HMM-GMM声学模型

HTKbook-3.5版本，增加DNN识别

语音识别资料 HTK BOOK 英文版和中文版

linux下HTK工具包

請給我HTK的相關文檔推薦

语音识别代码python案例htk

基于HTK（HMM）或者Kaldi框架完成一个语音识别、说话人识别或者语音合成的训练及测试

GMM-HMM语音识别源码

开源语音识别软件有哪些？

推荐语音识别matlab网站

用MATLAB实现语音识别的代码

最新资源