1.6 Continuous Speech Recognition 11
1. Allocate and zero accumulators for all parameters of all HMMs.
2. Get the next training utterance.
3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol
transcription of the training utterance.
4. Calculate the forward and backward probabilities for the composite HMM. The inclusion
of intermediate non-emitting states in the composite model requires some changes to the
computation of the forward and backward probabilities but these are only minor. The details
are given in chapter 8.
5. Use the forward and backward probabilities to compute the probabilities of state occupation
at each time frame and update the accumulators in the usual way.
6. Repeat from 2 until all training utterances have been processed.
7. Use the accumulators to calculate new parameter estimates for all of the HMMs.
These steps can then all be repeated as many times as is necessary to achieve the required conver-
gence. Notice that although the location of symbol boundaries in the training data is not required
(or wanted) for this procedure, the symbolic transcription of each training utterance is needed.
Whereas the extensions needed to the Baum-Welch procedure for training sub-word models are
relatively minor
6
, the corresponding extensions to the Viterbi algorithm are more substantial.
In HTK, an alternative formulation of the Viterbi algorithm is used called the Token Passing
Model
7
. In brief, the token passing model makes the concept of a state alignment path explicit.
Imagine each state j of a HMM at time t holds a single moveable token which contains, amongst
other information, the partial log probability ψ
j
(t). This token then represents a partial match
between the observation sequence o
1
to o
t
and the model subject to the constraint that the model
is in state j at time t. The path extension algorithm represented by the recursion of equation 1.31
is then replaced by the equivalent token passing algorithm which is executed at each time frame t.
The key steps in this algorithm are as follows
1. Pass a copy of every token in state i to all connecting states j, incrementing the log probability
of the copy by log[a
ij
] + log [ b
j
(o(t)].
2. Examine the tokens in every state and discard all but the token with the highest probability.
In practice, some modifications are needed to deal with the non-emitting states but these are
straightforward if the tokens in entry states are assumed to represent paths extended to time t −δt
and tokens in exit states are assumed to represent paths extended to time t + δt.
The point of using the Token Passing Model is that it extends very simply to the continuous
speech case. Suppose that the allowed sequence of HMMs is defined by a finite state network. For
example, Fig. 1.7 shows a simple network in which each word is defined as a sequence of phoneme-
based HMMs and all of the words are placed in a loop. In this network, the oval boxes denote HMM
instances and the square boxes denote word-end nodes. This composite network is essentially just
a single large HMM and the above Token Passing algorithm applies. The only difference now is
that more information is needed b eyond the log probability of the best token. When the best token
reaches the end of the speech, the route it took through the network must be known in order to
recover the recognised sequence of models.
6
In practice, a good deal of extra work is needed to achieve efficient operation on large training databases. For
example, the HERest tool includes facilities for pruning on both the forward and backward passes and parallel
operation on a network of machines.
7
See “Token Passing: a Conceptual Model for Connected Speech Recognition Systems”, SJ Young, NH Russell and
JHS Thornton, CUED Technical Report F INFENG/TR38, Cambridge University, 1989. Available by anonymous
ftp from svr-ftp.eng.cam.ac.uk.