基于发声截止频率的对数谱线性回归提升鲁棒语音识别性能

80 浏览量更新于2024-08-28 收藏 231KB PDF 举报

本文探讨了一种基于发声截止频率的对数谱线性回归在鲁棒语音识别中的应用，由作者 Yong Lü 和 Lin Zhou 提出。两位研究者分别来自中国南京的河海大学计算机与信息工程学院和东南大学信息科学与工程学院。他们关注的问题是提高语音识别系统的鲁棒性，尤其是在处理高噪声或变声条件下。传统的语音识别模型往往容易受到环境噪声和说话人变化的影响。为此，作者提出了一种最大似然对数谱线性回归方法，该方法首先通过反离散余弦变换（IDCT）将预训练的声学模型转换到对数频谱域，有效地忽略了高频部分的训练均值和方差，因为这些高频成分通常包含较多的噪声信息。这种方法的关键在于它能够减少噪声对识别性能的影响。在测试阶段，通过对数谱线性回归获取测试的均值和方差，并利用期望最大化（EM）算法估计参数。这种方法利用少量的适应数据，即在给定的噪声条件下进行微调，从而优化了模型以适应不同的语音特征。相比于传统的线性回归，该算法能够在保持较高识别准确性的前提下，展现出更好的鲁棒性。实验结果显示，基于发声截止频率的对数谱线性回归算法在实际应用中取得了显著的性能提升，特别是在处理复杂环境下的语音识别任务时，证明了其在提高语音识别系统稳健性和准确性方面的有效性。因此，这项研究对于改进现有的语音识别技术，尤其是在噪声抑制和适应性学习方面具有重要的理论价值和实践意义。关键词包括：发声截止频率、对数谱、最大似然、线性回归、鲁棒语音识别、期望最大化算法。

Log-Sepctral Linear Regression Based on Voicing Cut-Off Frequency for Robust

Speech Recognition

Yong Lü

, Lin Zhou

1. College of Computer and Information Engineering, Hohai University, Nanjing, China

2. School of Information Science and Engineering, Southeast University, Nanjing, China

E-mail: lynetwork@gmail.com

, linzhou@seu.edu.cn

Abstract—This paper proposes a maximum likelihood log-

spectral linear regression algorithm based on voicing cut-off

frequency for robust speech recognition, which converts the

pre-trained acoustic model to the log-spectral domain by the

inverse discrete cosine transform and ignores the high-

frequency part of the training mean and variance. Then the

testing mean and variance are obtained by the log-spectral

linear regression and the linear regression parameters are

estimated from small amounts of adaptive data using the

expectation–maximization algorithm under the maximum

likelihood criterion. The experimental results show that the

proposed algorithm can obtain more accurate testing acoustic

models and outperforms the traditional linear regression

method.

Keywords-voicing cut-off frequency; log-spectral linear

regression; robust speech recognition; model adaptation

I. INTRODUCTION

In real-world applications, the pre-trained acoustic

models generally do not match the feature vectors extracted

from the testing speech due to the background noise and

other variability in speech signals, which often causes the

performance degradation of speech recognition systems [1].

Therefore, it is necessary to take some algorithms to reduce

the impact of the environmental mismatch and improve the

recognition performance [2].

The speech signals are divided into a series of

overlapping frames in speech signal processing. For most

speech frames, the harmonic structure is most pronounced in

the lower part of the spectrum, which has motivated

researchers to split the spectrum in two distinct parts: a low-

frequency harmonic part and an aperiodic high-frequency

part [3]. The separation between both parts is referred to as

voicing cut-off frequency (VCO) [4]. In general, the high-

frequency part is easily affected by the background noise

and can not provide the effective information for speech

recognition in noisy environments. Thereby, we only

consider the low-frequency part in the feature extraction of

noisy speech, which can further improve the performance of

model adaptation.

The VCO of each frame is usually different from those

of other frames in speech signals. However, the speech

recognition system often employs the hidden Markov model

(HMM) as the acoustic model and considers the relationship

among the different frames of each utterance during the

recognition process. In addition, the HMMs are produced

using the training speech features which include the whole

spectrum of each frame. To adapt the pre-trained acoustic

model to match the noisy testing speech, an average VCO is

employed for feature extraction of each utterance in noisy

testing conditions. In the estimation of the average VCO, we

only consider the frames whose energy is close to the

maximum value of the utterance and ignore other frames.

This paper proposes a log-spectral linear regression

(LLR) algorithm based on voicing cut-off frequency for

robust speech recognition. In the algorithm, the training

features including the whole speech spectrum are used to

train the acoustic model of each speech unit, and the noisy

testing speech spectrum is classified as either voiced or

unvoiced based on the average VCO. For the sake of feature

extraction, the VCO is approximated at the upper frequency

of every channel of the Mel filter bank and the noisy testing

features only include the Mel channels below the average

VCO of the speech utterance. To obtain the testing acoustic

model, we first convert the pre-trained HMM to the log-

spectral domain by the inverse discrete cosine transform

(DCT) and ignore the high-frequency part of the training

mean and variance. Then it is assumed that the testing mean

and variance can be obtained by the log-spectral linear

regression. The linear regression parameters are estimated

from small amounts of adaptive data using the expectation–

maximization (EM) algorithm under the maximum

likelihood criterion.

II. L

OG-SPECTRAL LINEAR REGRESSION BASED ON VCO

In the testing environment, the noisy speech spectrum is

classified as either voiced or unvoiced at the upper

frequency of every Mel channel, based on the voicing cut-

off frequency estimation method [4]. For each voiced

speech frame with a pitch frequency, we estimate the

number of voiced pitch harmonics and obtain the VCO as

the product of the pitch frequency and harmonic number.

Then the VCO is averaged over all frames of the voiced

speech segment and is further approximated at the upper

frequency of every channel of the Mel filter bank.

This paper only considers the Mel channels below the

average VCO of the speech segment in the feature

extraction of the testing speech. Therefore the channel

number of the testing speech is less than that of the training

speech. Hoverer, the acoustic model of every speech unit is

trained by the clean training speech and thus must be

modified and adapted to match the testing features.

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38516190

粉丝: 8
资源: 896

基于发声截止频率的对数谱线性回归提升鲁棒语音识别性能

遗传算法优化的自适应频带滤波器组，用于鲁棒语音识别系统

基于线性回归的鲁棒人脸识别新方法

改进的MFCC特征提取与对称ICA算法相结合用于鲁棒语音识别

基于FSS与PLP的噪声鲁棒语音识别

局部线性嵌入优化光谱回归的鲁棒人脸识别.pdf

用于噪声鲁棒语音识别的通用可变参数HMM的自动复杂度控制

基于矢量泰勒级数的鲁棒语音识别 (2011年)

基于FSS与 PLP的噪声鲁棒语音识别 (2008年)

Python_基于大规模弱监督的鲁棒语音识别.zip

基于发音特征的声效相关鲁棒语音识别算法 (2015年)

最新资源