DNN-HMM多语种电话语音识别器：性能分析与Kaldi实践

需积分: 10 160 浏览量更新于2024-07-17 1 收藏 2.01MB PDF 举报

本篇论文主要探讨了基于深度神经网络（DNN）和隐马尔可夫模型（HMM）的多语言电话语音识别问题，针对五个东欧语言——捷克语、俄语、匈牙利语、斯洛伐克语和波兰语，这些语言的语音数据集在SpeechDat-E中可用。由于所使用的SAMPA（Simplified Articulatory Matrix Phonetic Alphabet）编码不规范，且不同符号代表相同的音素，首先提出了将特定语言的音素映射到通用的X-SAMPA音标字母表的方法。研究重点在于分析多语言声学建模对连续语音识别任务的影响。分别对基于高斯混合模型-隐马尔可夫模型（GMM-HMM）系统和基于深度神经网络-高斯混合模型（DNN-GMM）方法进行了分析。实验是在保持每种语言特定声学模型不变的情况下进行的，利用Kaldi工具包实现了这些识别器。论文目标之一是提供Kaldi工具的教程式描述和SpeechDat数据库的使用指南，以便于该领域研究人员的进一步研究。单语言HMM识别器在不同语言中的最佳准确率达到了18%至28%的词错误率（WER）。引入DNN-HMM后，整体上平均提升了约4%的WER。对于多语言HMM系统，识别准确率范围在25%至37%的WER之间。对于多语言DNN模型，其对语音识别准确性产生了显著提升，平均降低了约9%的WER。论文还涵盖了语音识别任务中的音素识别和大词汇连续语音识别分析，以全面评估DNN-HMM架构在多语言电话语音识别中的性能。通过这个研究，作者不仅展示了深度学习技术在语音识别中的应用优势，而且提供了实用的工具和技术指导，为后续的研究者提供了宝贵的参考。

2 Continuous Speech Recognition

The ﬁrst speech recognition attempts were to recognize the isolated words and expres-

sions. The principle of the ﬁrst recognizers was a template matching. To evaluate and

compare two utterances, the dynamic programing was used to model the nonlinear vari-

ations in the spe ech speed of one of the utterances. Such approach is called the dynamic

time warping and it was the most used classiﬁcation method in the 70s and early 80s.

During the 80s, the statistical classiﬁcation methods were introduced and laid down

the base for continuos speech recognition. The statistical approach of continuos speech

recognition is described in this chapter.

Figure 1 The principle of a statistical large vocabulary speech recognition approach.

An acoustic analysis of the input speech signal performs two main subtasks. The ﬁrst

is the s ignal processing itself. It can include denoising, echo cancellation, pre-emphasis

and other modiﬁcations to clean and normalize the input speech audio signal. The main

function of the acoustic analysis is to extract the sequence of features that is processed

and recognized by a decoder. The elements of this sequence represent the feature vectors

in the individual time steps t. Let’s denote this sequence as O =(x

, x

,...,x

Let’s assume the sequence W =(w

,...,w

) of n words. Then the sequence of

the acoustic observations O generates W with the probability P (W |O). The key task

of the decoder is to ﬁnd such sequence W

which maximizes the probability P (W |O),

written as

= argmax

P (W |O) . (1)

2 Continuous Speech Recognition

Thus, it is a decoding with the maximum a posteriori probability (MAP). The Eq. 1

can be rewritten with Bayes’ rule to the form

= argmax

P (W )P (O|W )

P (O)

. (2)

The a priori probability P(W) is the probability that a speaker will say the sequence

of words W . P (O|W ) is the probability, that the feature vector sequenc e is produced

when the W sequence is pronounced. The a priory probability of the observation feature

sequence O can be omitted, since it is constant under the max operation which results

= argmax

P (W )P (O|W ). (3)

As it can be seen, the decoding problem can be decomposed into the evaluation of a

two probabilities. These probabilities are independent which means that they can be

trained separately. The probability P (W ) is called, or represented by, the Language

Model (LM), that reﬂects the semantic and(or) syntactic constrains of the given lan-

guage. P (O|W ) is determined by an acoustic model. It needs to be stated, that the

evaluation of Eq. 3, thus obtaining the W

for observed O over all possible sequences

W , involves enormous number of operations and it is computationally very expensive.

The sophisticated decoding techniques has to be applied to obtain the desired sequence

To conclude, the statistical continuous speech recognition task can be formulated in

the form of the following problems:

• The acoustic processing problem. Signal processing in the time domain to remove

or reconstruct missing information. Then, the proper features are needed to be

extracted out of the speech signal. It means to ﬁnd such feature vectors with as

low dimension as possible while keeping suﬃcient amount of information.

• To train appropriate model to evaluate the probability P (O|W ). It means to decide

which acoustic units are to be modeled and what evaluation mechanism should be

used (HMM, ANN, ...)

• Train the language model and evaluate the probability P (W ).

• Obtain the sequence W

in acceptable time by using the proper methods.

2.1 Acoustic Analysis

Depending on the various conditions like the environment, quality of communication

canal or with respect to the nature of a human speech production, the speech signal

often suﬀers from information loss or abundance of misleading information, which is

inappropriate for further processing. Between common preprocessing methods belongs

2.1 Acoustic Analysis

the pre-emphasis, that compensate the energy loss proportionate to the increasing fre-

quency. Regarding the human speech production system, the speech signal is considered

as stationary in short time intervals around 10-30 ms during which the current state of

production system is being kept. This state corresponds to the sound unit that is then

recognized. A further speech signal processing therefore requires the short time analysis

both for the time and sp ectral domain. So, the next step is a signal segmentation. Many

experiments proved 25 ms to be an optimal segment width with 10 ms shift. Several

windows with diﬀerent characteristics are used for this purpose. Namely rectangular

window, Hanning or the most used Hamming window. The complete description of

short-time analyses methods and their principles can be found for example in [14].

Since the time domain reﬂects every aspec ts of a signal production and transmis-

sion channel, the time steps themselves are not s uitable for the direct classiﬁcation and

modeling. The key function of the acoustic analysis is to provide a proper and robust

features. The widely used are the Mel-Frequency Cepstrum Coeﬃcients (MFCC) and

features based on Perceptual Linear Predictive analysis (PLP) [16] also provided promis-

ing results in many speech applications. These methods of the feature computation are

brieﬂy described further.

2.1.1 MFCC

The MFCC features are designed with respect to the human audio perception. The

human ear does not perceive the frequencies in linear scale but in a logarithmic one.

This property is simulated by the application of a ﬁlter-bank in the frequency domain.

The ﬁlter bank consists of triangular ﬁlters designed in the Mel-scale and is illustrated

in Fig. 3a. The conversion from a linear to Mel-scale is given by

mel

= 2595 log

700

The procedure of MFCC computation is following. The signal is pre-emphasized and

the short-time analysis is then performed. It means that the magnitude frequency

spectrum is computed and ﬁltered via the Mel Filter Bank that is designed with respect

to the requirements and the signal properties. Then the logarithm of ﬁlter outputs is

computed, that allows to divide the convolution channel distortion. The Discrete Cosine

Transformation (DCT) decorrelates the output coeﬃcients, which is desired for further

statistical classiﬁer. The MFCC features tries to emulate the human perception of

Figure 2 The principle of MFCC computation.

剩余65页未读，继续阅读

aiXpert

粉丝: 224
资源: 11

DNN-HMM多语种电话语音识别器：性能分析与Kaldi实践

隐马尔可夫模型(HMM)前向算法实例维特比算法实例

基于hmm算法的语音识别

基于HMM的语音识别系统研究

HMM 语音识别

DNN-HMM实现的代码实现

GMM-HMM语音识别源码

ubuntu20.04编译dnn-opencv

DNN-贝叶斯算法代码

运行 'DNN-model' 时出错: SDK is not defined for Run Configuration

我想写一个可交互界面的语音转换系统，我需要使用thchs30来训练一些模型，并且将写模型用于处理新的音频文件

最新资源