1.1. SPEECH RECOGNITION AND COMPUTATION 3
the number of hypotheses to be compared is |V |
L
(e.g., |V |
L
= 10
15
for |V |=1000 and L = 5).
If L is unknown, i.e., in a general case, it results in
1≤l≤
ˆ
L
|V |
l
, where
ˆ
L is an assumed upper
bound of L. If we enumerate all the hypotheses and compare each of them with the input signal,
the complexity becomes O(|V |
ˆ
L
×
¯
|R|×|S|). Thus, dealing with a large vocabulary in continuous
speech recognition potentially has a large impact on the amount of computation required in the
decoding process.
In fact, we do not have to enumerate all the hypotheses in LVCSR. Instead, we can use
one-pass DP matching [BBC82, Ney84], which is a basic but efficient approach to continuous
speech recognition. This method is called one-pass Viterbi algorithm when probabilistic models
such as HMMs and an n-gram language model are used. If the HMMs are typical left-to-right
type, i.e., each state has only one self loop and one exiting transition, the computational complexity
is O(|V |
N−1
×
¯
|M|×|S|), where n is typically 2 or 3 and
¯
|M| is the number of HMM states per
word. Thus, the total computation is much less than that required for the full enumeration.
Most current speech recognition decoders are based on one-pass Viterbi algorithm (for details
see Chapter 2). The algorithm ensures that the best hypothesis for a speech signal is found with
given acoustic and language models. However, it is expensive to search for the best sequence of
words from among a large vocabulary of over 10 thousand. In DARPA projects, search strategies for
efficient decoding, which do not necessarily ensure the best hypothesis, were intensively investigated
together with highly accurate acoustic and language models.
The most practical approach to reducing the computation for LVCSR decoding is to abandon
the verification of all possible hypotheses. Beam search is the most popular method for reducing
the number of hypotheses verified during the decoding process, which was originally introduced in
1976 [Low76]. With the Viterbi algorithm, partial sentence hypotheses are extended synchronously
with time from the beginning of the speech. With the beam search, relatively unpromising partial
hypotheses are selected and pruned out at each time frame. Those pruned hypotheses are no longer
extended. As a result, the amount of computation can be reduced significantly since only some of
the hypotheses are evaluated until the end of the speech signal. However, there is a potential risk
that the correct partial hypothesis that would become the best sentence hypothesis may be lost by
pruning.To eliminate such pruning errors, the beam search has been improved with various methods
such as look-ahead techniques.
On the other hand, the efficient representation of speech information was also investigated
to reduce redundant search space. A tree-organized lexicon was successfully introduced to represent
the LVCSR search space. It is a data structure that shares pronunciation prefixes of the words
in the vocabulary as a prefix tree (or called trie). This structure is also effective in alleviating the
upswing of hypotheses at word boundaries, which causes an increase in pruning errors. Without a
tree-organized lexicon, there are |V | possible branches from the end of each word when extending
partial hypotheses. By using this tree structure, the number of branches decreases to at most the
number of phones.