HTK技术手册：语音识别与HMM基础

需积分: 10 29 浏览量更新于2024-07-24 收藏 2.35MB PDF 举报

"HTK帮助文档是一份官方的技术文档，用于支持语音识别的研究和论文写作。这份文档涵盖了HTK工具包的基础知识和应用，适合需要理解HTK工作原理和使用方法的读者。" HTK（ Hidden Markov Model Toolkit）是用于构建和训练隐马尔可夫模型（HMMs）的开源工具包，主要用于语音识别领域。HTK Book是这份文档的核心，由多个专家共同编著，并在多次修订中不断更新以适应HTK的不同版本，包括从最初的2.1版到3.4版的改进。文档的主要内容分为两大部分：第一部分是教程概述，深入浅出地介绍了HTK的基本概念和原理： 1. 隐马尔可夫模型（HMMs）的一般原则：这部分讲解了HMMs如何被用来建模语音信号，以及它们在语音识别中的核心作用。 2. 孤立词识别：讨论了如何使用HTK处理孤立的、不连续的单词识别任务。 3. 输出概率规范：解释了HTK如何处理和表示模型的输出概率。 4. Baum-Welch重估计：这是HMM参数学习的一种方法，用于优化模型性能。 5. 识别与维特比解码：介绍维特比算法，它是HMMs进行序列最优化和识别的关键算法。 6. 连续语音识别：讨论如何处理连续的、非孤立的语音流。 7. 发音者适应：探讨了如何根据不同的说话人调整模型以提高识别准确性。第二部分是对HTK工具包的总体概览： 1. HTK软件架构：阐述了HTK工具包的整体结构和组成部分。 2. HTK工具的通用属性：描述了HTK工具共有的特征和使用方式。 3. HTK工具的简要介绍：列举并解释了一些主要的HTK工具，如HHEd用于编辑HMM模型，HVite用于训练和识别等。通过这份文档，读者不仅可以理解HTK的工作机制，还能学习如何使用HTK进行语音识别系统的设计、训练和评估。对于从事语音识别研究或开发的人员来说，HTK帮助文档是一份极其宝贵的参考资料。

1.4 Baum-Welch Re-Estimation 7

If there was just one state j in the HMM, this parameter estimation would be easy. The maximum

likelihood estimates of µ

and Σ

would be just the simple averages, that is

t=1

(1.11)

and

t=1

− µ

)(o

− µ

)

(1.12)

In practice, of course, there are multiple states and there is no direct assignment of observation

vectors to individual states because the underlying state sequence is unknown. Note, however, that

if some approximate assignment of vectors to states could be made then equations 1.11 and 1.12

could be used to give the required initial values for the parameters. Indeed, this is exactly what

is done in the HTK tool called HInit. HInit ﬁrst divides the training observation vectors equally

amongst the model states and then uses equations 1.11 and 1.12 to give initial values for the mean

and variance of each state. It then ﬁnds the maximum likelihood state sequence using the Viterbi

algorithm described below, reassigns the observation vectors to states and then uses equations 1.11

and 1.12 again to get b etter initial values. This process is repeated until the estimates do not

change.

Since the full likelihood of each observation sequence is based on the summation of all possi-

ble state sequences, each observation vector o

contributes to the computation of the maximum

likelihood parameter values for each state j. In other words, instead of assigning each observation

vector to a speciﬁc state as in the above approximation, each observation is assigned to every state

in proportion to the probability of the model being in that state when the vector was observed.

Thus, if L

(t) denotes the probability of being in state j at time t then the equations 1.11 and 1.12

given above become the following weighted averages

t=1

(t)o

t=1

(t)

(1.13)

and

t=1

(t)(o

− µ

)(o

− µ

)

t=1

(t)

(1.14)

where the summations in the denominators are included to give the required normalisation.

Equations 1.13 and 1.14 are the Baum-Welch re-estimation formulae for the means and covari-

ances of a HMM. A similar but slightly more complex formula can be derived for the transition

probabilities (see chapter 8).

Of course, to apply equations 1.13 and 1.14, the probability of state occupation L

(t) must

be calculated. This is done eﬃciently using the so-called Forward-Backward algorithm. Let the

forward probability

(t) for some model M with N states be deﬁned as

(t) = P (o

, . . . , o

, x(t) = j|M ). (1.15)

That is, α

(t) is the joint probability of observing the ﬁrst t speech vectors and being in state j at

time t. This forward probability can be eﬃciently calculated by the following recursion

(t) =

N−1

i=2

(t − 1)a

). (1.16)

This recursion depends on the fact that the probability of being in state j at time t and seeing

observation o

can be deduced by summing the forward probabilities for all possible predecessor

states i weighted by the transition probability a

. The slightly odd limits are caused by the fact

that states 1 and N are non-emitting

. The initial conditions for the above recursion are

(1) = 1 (1.17)

Since the output distributions are densities, these are not really probabilities but it is a convenient ﬁction.

To understand equations involving a non-emitting state at time t, the time should be thought of as being t − δt

if it is an entry state, and t + δ t if it is an exit state. This becomes important when HMMs are connected together

in sequence so that transitions across non-emitting states take place between frames.

1.4 Baum-Welch Re-Estimation 8

(1) = a

) (1.18)

for 1 < j < N and the ﬁnal condition is given by

(T ) =

N−1

i=2

(T )a

. (1.19)

Notice here that from the deﬁnition of α

(t),

P (O|M) = α

(T ). (1.20)

Hence, the calculation of the forward probability also yields the total likelihood P (O|M).

The backward probability β

(t) is deﬁned as

(t) = P (o

t+1

, . . . , o

|x(t) = j, M). (1.21)

As in the forward case, this backward probability can b e computed eﬃciently using the following

recursion

(t) =

N−1

j=2

t+1

)β

(t + 1) (1.22)

with initial condition given by

(T ) = a

(1.23)

for 1 < i < N and ﬁnal condition given by

(1) =

N−1

j=2

)β

(1). (1.24)

Notice that in the deﬁnitions above, the forward probability is a joint probability whereas the

backward probability is a conditional probability. This somewhat asymmetric deﬁnition is deliberate

since it allows the probability of state occupation to be determined by taking the pro duct of the

two probabilities. From the deﬁnitions,

(t)β

(t) = P (O, x(t) = j|M ). (1.25)

Hence,

(t) = P (x(t) = j|O, M) (1.26)

P (O, x(t) = j|M)

P (O|M )

(t)β

(t)

where P = P (O|M).

All of the information needed to perform HMM parameter re-estimation using the Baum-Welch

algorithm is now in place. The steps in this algorithm may be summarised as follows

1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator

and denominator summations of the form illustrated by equations 1.13 and 1.14. These storage

locations are referred to as accumulators

2. Calculate the forward and backward probabilities for all states j and times t.

3. For each state j and time t, use the probability L

(t) and the current observation vector o

to update the accumulators for that state.

Note that normally the summations in the denominators of the re-estimation formulae are identical across the

parameter sets of a given state and therefore only a single common storage location for the denominators is required

and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can

result in the denominator summations being diﬀerent. Hence, in HTK the denominator summations are always

stored and calculated individually for each distinct parameter vector or matrix.

1.5 Recognition and Viterbi Decoding 9

4. Use the ﬁnal accumulator values to calculate new parameter values.

5. If the value of P = P (O|M ) for this iteration is not higher than the value at the previous

iteration then stop, otherwise repeat the above steps using the new re-estimated parameter

values.

All of the above assumes that the parameters for a HMM are re-estimated from a single ob-

servation sequence, that is a single example of the spoken word. In practice, many examples are

needed to get good parameter estimates. However, the use of multiple observation sequences adds

no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct

training sequence.

One ﬁnal point that should b e mentioned is that the computation of the forward and backward

probabilities involves taking the product of a large number of probabilities. In practice, this means

that the actual numbers involved become very small. Hence, to avoid numerical problems, the

forward-backward computation is computed in HTK using log arithmetic.

The HTK program which implements the above algorithm is called HRest. In combination

with the tool HInit for estimating initial values mentioned earlier, HRest allows isolated word

HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.

1.5 Recognition and Viterbi Decoding

The previous section has described the basic ideas underlying HMM parameter re-estimation using

the Baum-Welch algorithm. In passing, it was noted that the eﬃcient recursive algorithm for

computing the forward probability also yielded as a by-product the total likelihood P (O|M). Thus,

this algorithm could also be used to ﬁnd the model which yields the maximum value of P (O|M

and hence, it could be used for recognition.

In practice, however, it is preferable to base recognition on the maximum likelihood state se-

quence since this generalises easily to the continuous speech case whereas the use of the total

probability does not. This likelihood is computed using essentially the same algorithm as the for-

ward probability calculation except that the summation is replaced by a maximum operation. For

a given model M, let φ

(t) represent the maximum likelihood of observing speech vectors o

and being in state j at time t. This partial likelihood can be computed eﬃciently using the

following recursion (cf. equation 1.16)

(t) = max

{φ

(t − 1)a

). (1.27)

where

(1) = 1 (1.28)

(1) = a

) (1.29)

for 1 < j < N. The maximum likelihood

P (O|M) is then given by

(T ) = max

{φ

(T )a

} (1.30)

As for the re-estimation case, the direct computation of likelihoods leads to underﬂow, hence,

log likelihoods are used instead. The recursion of equation 1.27 then becomes

(t) = max

{ψ

(t − 1) + log(a

)} + log(b

)). (1.31)

This recursion forms the basis of the so-called Viterbi algorithm. As shown in Fig. 1.6, this algorithm

can be visualised as ﬁnding the best path through a matrix where the vertical dimension represents

the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time).

Each large dot in the picture represents the log probability of observing that frame at that time and

each arc between dots corresponds to a log transition probability. The log probability of any path

is computed simply by summing the log transition probabilities and the log output probabilities

along that path. The paths are grown from left-to-right column-by-column. At time t, each partial

path ψ

(t − 1) is known for all states i, hence equation 1.31 can be used to compute ψ

(t) thereby

extending the partial paths by one time frame.

1.6 Continuous Speech Recognition 10

State

Speech

Frame

(Time)

1 2 3 4 5 6

( )

Fig. 1.6 The Viterbi Algorithm for Isolated Word

Recognition

This concept of a path is extremely important and it is generalised below to deal with the

continuous speech case.

This completes the discussion of isolated word recognition using HMMs. There is no HTK tool

which implements the above Viterbi algorithm directly. Instead, a tool called HVite is provided

which along with its supp orting libraries, HNet and HRec, is designed to handle continuous

speech. Since this recogniser is syntax directed, it can also perform isolated word recognition as a

special case. This is discussed in more detail below.

1.6 Continuous Speech Recognition

Returning now to the conceptual model of speech production and recognition exempliﬁed by Fig. 1.1,

it should be clear that the extension to continuous speech simply involves connecting HMMs together

in sequence. Each model in the sequence corresponds directly to the assumed underlying symbol.

These could be either whole words for so-called connected speech recognition or sub-words such as

phonemes for continuous speech recognition. The reason for including the non-emitting entry and

exit states should now be evident, these states provide the glue needed to join models together.

There are, however, some practical diﬃculties to overcome. The training data for continuous

speech must consist of continuous utterances and, in general, the boundaries dividing the segments

of speech corresponding to each underlying sub-word model in the sequence will not be known. In

practice, it is usually feasible to mark the boundaries of a small amount of data by hand. All of

the segments corresponding to a given model can then be extracted and the isolated word style

of training described above can be used. However, the amount of data obtainable in this way is

usually very limited and the resultant models will be poor estimates. Furthermore, even if there

was a large amount of data, the boundaries imposed by hand-marking may not be optimal as far

as the HMMs are concerned. Hence, in HTK the use of HInit and HRest for initialising sub-word

models is regarded as a bootstrap operation

. The main training phase involves the use of a tool

called HERest which does embedded training.

Embedded training uses the same Baum-Welch procedure as for the isolated case but rather

than training each model individually all models are trained in parallel. It works in the following

steps:

1. Allocate and zero accumulators for all parameters of all HMMs.

2. Get the next training utterance.

3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol

transcription of the training utterance.

They can even be avoided altogether by using a ﬂat start as described in section 8.3.

1.6 Continuous Speech Recognition 11

4. Calculate the forward and backward probabilities for the composite HMM. The inclusion

of intermediate non-emitting states in the composite model requires some changes to the

computation of the forward and backward probabilities but these are only minor. The details

are given in chapter 8.

5. Use the forward and backward probabilities to compute the probabilities of state occupation

at each time frame and update the accumulators in the usual way.

6. Repeat from 2 until all training utterances have been processed.

7. Use the accumulators to calculate new parameter estimates for all of the HMMs.

These steps can then all be repeated as many times as is necessary to achieve the required conver-

gence. Notice that although the location of symbol boundaries in the training data is not required

(or wanted) for this procedure, the symbolic transcription of each training utterance is needed.

Whereas the extensions needed to the Baum-Welch procedure for training sub-word models are

relatively minor

, the corresponding extensions to the Viterbi algorithm are more substantial.

In HTK, an alternative formulation of the Viterbi algorithm is used called the Token Passing

Model

. In brief, the token passing model makes the concept of a state alignment path explicit.

Imagine each state j of a HMM at time t holds a single moveable token which contains, amongst

other information, the partial log probability ψ

(t). This token then represents a partial match

between the observation sequence o

to o

and the model subject to the constraint that the model

is in state j at time t. The path extension algorithm represented by the recursion of equation 1.31

is then replaced by the equivalent token passing algorithm which is executed at each time frame t.

The key steps in this algorithm are as follows

1. Pass a copy of every token in state i to all connecting states j, incrementing the log probability

of the copy by log[a

] + log [ b

(o(t)].

2. Examine the tokens in every state and discard all but the token with the highest probability.

In practice, some modiﬁcations are needed to deal with the non-emitting states but these are

straightforward if the tokens in entry states are assumed to represent paths extended to time t −δt

and tokens in exit states are assumed to represent paths extended to time t + δt.

The point of using the Token Passing Model is that it extends very simply to the continuous

speech case. Suppose that the allowed sequence of HMMs is deﬁned by a ﬁnite state network. For

example, Fig. 1.7 shows a simple network in which each word is deﬁned as a sequence of phoneme-

based HMMs and all of the words are placed in a loop. In this network, the oval boxes denote HMM

instances and the square boxes denote word-end nodes. This composite network is essentially just

a single large HMM and the above Token Passing algorithm applies. The only diﬀerence now is

that more information is needed b eyond the log probability of the best token. When the best token

reaches the end of the speech, the route it took through the network must be known in order to

recover the recognised sequence of models.

In practice, a good deal of extra work is needed to achieve eﬃcient operation on large training databases. For

example, the HERest tool includes facilities for pruning on both the forward and backward passes and parallel

operation on a network of machines.

See “Token Passing: a Conceptual Model for Connected Speech Recognition Systems”, SJ Young, NH Russell and

JHS Thornton, CUED Technical Report F INFENG/TR38, Cambridge University, 1989. Available by anonymous

ftp from svr-ftp.eng.cam.ac.uk.

剩余367页未读，继续阅读

illool

粉丝: 1
资源: 11

HTK技术手册：语音识别与HMM基础

htk book、htk的说明文档

HTK.rar_HTK_HTK工具包_htk-3.3_htk.e

HTKbook--chinese.zip_HTK 语音处理_HTK 中文_HTKbook--chinese_HTKbook中文

HTK.rar_ HTK _HTK

语音识别HTK安装说明文档

HTK.rar_HTK_htk book

HTK-samples-3.4.zip_HTK_HTK语音_htk pudn_htk-3.3_mfc HTK

HTK.rar_HTK

htk.zip_HTK

HTK.zip_DEMO_HTK_htk cambridge_htk3 ba

最新资源