基于音素分割的随机音素段模型语音识别

需积分: 9 164 浏览量更新于2024-08-07 收藏 222KB PDF 举报

"语音识别利用基于音素分割的随机音素段模型" 本文主要探讨了一种新的基于自动提取的音素段参数训练的统计音素段模型在语音识别中的应用。作者Chieko Furuichi、Katsura Aizawa和Kazuhiko Inoue来自横滨东洋大学工学院，他们提出了一种改进的语音识别系统，该系统通过精细化的音素分割和概率模型来提升识别效果。传统的语音识别系统通常涉及将连续的语音信号转换为离散的音素序列，然后通过匹配这些音素序列与词典条目来进行识别。然而，这一过程往往受到语音的连续性和说话人的个体差异影响，导致识别准确率下降。在新提出的系统中，首先进行预处理，通过音素分割算法检测出音素边界。这一阶段利用自动化方法提高了音素边界检测的准确性，减少了不必要的参数，保留了对区分音素有效的特征参数。接着，系统采用一种随机音素段模型来区分不同的音素，构建了带有得分的音素段格状结构。这种模型具有一定的随机性，能够更好地适应语音信号的变化和不确定性。随后，通过符号序列与词典条目的匹配进行语音识别。这种方法将连续语音中的音素识别问题转化为一个分类问题，简化了问题的复杂性，并且对说话人的依赖性降低，提高了对不同说话人语音的识别能力。此外，由于采用了基于音素分割的方法，新模型可以更好地处理语音中的重叠和过渡现象，使得识别系统能够在保持效率的同时提高识别精度。这尤其对于实时的语音交互应用，如智能助手和自动驾驶汽车的语音控制系统，有着重要的实用价值。这项研究展示了如何通过改进的音素分割技术和随机模型优化语音识别过程，为未来语音识别技术的发展提供了新的思路和可能的解决方案。其在学术领域具有较高的研究价值，同时也对实际的语音处理应用有着积极的推动作用。

Speech Recognition Using Stochastic Phonemic Segment Model

Based on Phoneme Segmentation

Chieko Furuichi, Katsura Aizawa, and Kazuhiko Inoue

Faculty of Engineering, Toin University of Yokohama, 1614 Kurogane, Midori, Yokohama, Ja pan 225-8502

SUMMARY

This paper discusses speech recognition based on a

new statistical phoneme segment model which is trained by

phoneme parameters derived from automatically extracted

phoneme segments. The proposed system operates as fol -

lows. In preprocessing before recognition, the phoneme

boundaries are detected by segmentation. The phonemes

are discriminated using a stochastic phoneme segment

model, and a phoneme segment lattice with scores is con-

structed. Next the speech recognition is performed by

matching of symbol sequences to dictionary items. The

segmentation system that is employed can infer phoneme

boundaries with high accuracy. This helps to eliminate

unnecessary parameters, leaving the feature parameters

which are effective in separating phonemes. In other words,

the phoneme recognition problem in continuous speech can

be reduced to a discrimination problem and thus a speaker-

independent model can be constructed from a relatively

small number of training data. The stochastic phoneme

segment model is trained with training samples extracted

from a phoneme-balanced word set of 4920 words uttered

by 10 speakers. In a recognition experiment with 6709

words uttered by 63 nontraining speakers, a recognition rate

of 92.6% was obtained as the average for all speakers, using

Syst Comp Jpn, 31(10): 8998, 2000

Key words: Segment model; mixed distribution;

phoneme segmentation; speech recognition.

1. Introduction

In continuous speech recognition systems, it is desir-

able to improve the accuracy of the acoustic model in order

to improve the recognition rate for speech units such as

phonemes and syllables. In recent years, many studies of

segment models have attempted to include the temporal

changes of the speech feature parameters in order to im-

prove the accuracy of the acoustic model [14]. When a

segment model is applied to recognition, the dimension of

the parameters is usually increased. If the amount of train-

ing data is insufficient, the estimation accuracy of the model

may be degraded, or a large amount of computation may be

needed for recognition. Approaches to dealing with this

problem have included compression of the parameter di-

mension by K-L expansion [5], and use of the output from

a neural network into which several consecutive frames are

simultaneously input [6].

In the recognition of continuous speech by the seg-

ment model, there can be two approaches. One is to perform

recognition without applying preliminary segmentation.

The other is to detect the boundaries between phonemes or

syllables by segmentation, and then to perform recognition

using the segment model. The former method has been used

more often, since segmentation is very difficult and a sys-

tem accurate enough to be used for preprocessing before

recognition is difficult to create.

If the boundaries between phonemes or syllables can

be estimated with high accuracy by the latter method,

however, the problem of recognizing phonemes or syllables

in continuous speech can be reduced to a discrimination

problem, unnecessary searching can be minimized, and the

Systems and Computers in Japan, Vol. 31, No. 10, 2000

Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-II, No. 7, July 1999, pp. 11111119

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38502929

粉丝: 7
资源: 959

基于音素分割的随机音素段模型语音识别

基于音素分割的随机音素分割模型语音识别

EESEN END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODIN

Speech Synthesis & Speech Recognition Using SAPI 5.1

speech emotion recognition system using Gaussian Mixture Model

Speech Recognition Using Deep Neural Networks A Systematic Review.pdf

Robust Feature Extraction for Speech Recognition Based on Perceptually Motivated MUSIC and CCBC

Detection of laser-induced optical defects based on image segmentation

speechrecognition.rar_speech recognition

The-Speech-Recognition-using-HMMs.rar_HMMS_speech recognition_语音

Speech Recognition.rar_in_recognition Word_speech recognition

最新资源