Speech Recognition Using Stochastic Phonemic Segment Model
Based on Phoneme Segmentation
Chieko Furuichi, Katsura Aizawa, and Kazuhiko Inoue
Faculty of Engineering, Toin University of Yokohama, 1614 Kurogane, Midori, Yokohama, Ja pan 225-8502
SUMMARY
This paper discusses speech recognition based on a
new statistical phoneme segment model which is trained by
phoneme parameters derived from automatically extracted
phoneme segments. The proposed system operates as fol -
lows. In preprocessing before recognition, the phoneme
boundaries are detected by segmentation. The phonemes
are discriminated using a stochastic phoneme segment
model, and a phoneme segment lattice with scores is con-
structed. Next the speech recognition is performed by
matching of symbol sequences to dictionary items. The
segmentation system that is employed can infer phoneme
boundaries with high accuracy. This helps to eliminate
unnecessary parameters, leaving the feature parameters
which are effective in separating phonemes. In other words,
the phoneme recognition problem in continuous speech can
be reduced to a discrimination problem and thus a speaker-
independent model can be constructed from a relatively
small number of training data. The stochastic phoneme
segment model is trained with training samples extracted
from a phoneme-balanced word set of 4920 words uttered
by 10 speakers. In a recognition experiment with 6709
words uttered by 63 nontraining speakers, a recognition rate
of 92.6% was obtained as the average for all speakers, using
a word dictionary of 212 words. © 2000 Scripta Technica,
Syst Comp Jpn, 31(10): 8998, 2000
Key words: Segment model; mixed distribution;
phoneme segmentation; speech recognition.
1. Introduction
In continuous speech recognition systems, it is desir-
able to improve the accuracy of the acoustic model in order
to improve the recognition rate for speech units such as
phonemes and syllables. In recent years, many studies of
segment models have attempted to include the temporal
changes of the speech feature parameters in order to im-
prove the accuracy of the acoustic model [14]. When a
segment model is applied to recognition, the dimension of
the parameters is usually increased. If the amount of train-
ing data is insufficient, the estimation accuracy of the model
may be degraded, or a large amount of computation may be
needed for recognition. Approaches to dealing with this
problem have included compression of the parameter di-
mension by K-L expansion [5], and use of the output from
a neural network into which several consecutive frames are
simultaneously input [6].
In the recognition of continuous speech by the seg-
ment model, there can be two approaches. One is to perform
recognition without applying preliminary segmentation.
The other is to detect the boundaries between phonemes or
syllables by segmentation, and then to perform recognition
using the segment model. The former method has been used
more often, since segmentation is very difficult and a sys-
tem accurate enough to be used for preprocessing before
recognition is difficult to create.
If the boundaries between phonemes or syllables can
be estimated with high accuracy by the latter method,
however, the problem of recognizing phonemes or syllables
in continuous speech can be reduced to a discrimination
problem, unnecessary searching can be minimized, and the
© 2000 Scripta Technica
Systems and Computers in Japan, Vol. 31, No. 10, 2000
Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-II, No. 7, July 1999, pp. 11111119
89