Log-Sepctral Linear Regression Based on Voicing Cut-Off Frequency for Robust
Speech Recognition
Yong Lü
1
, Lin Zhou
2
1. College of Computer and Information Engineering, Hohai University, Nanjing, China
2. School of Information Science and Engineering, Southeast University, Nanjing, China
E-mail: lynetwork@gmail.com
, linzhou@seu.edu.cn
Abstract—This paper proposes a maximum likelihood log-
spectral linear regression algorithm based on voicing cut-off
frequency for robust speech recognition, which converts the
pre-trained acoustic model to the log-spectral domain by the
inverse discrete cosine transform and ignores the high-
frequency part of the training mean and variance. Then the
testing mean and variance are obtained by the log-spectral
linear regression and the linear regression parameters are
estimated from small amounts of adaptive data using the
expectation–maximization algorithm under the maximum
likelihood criterion. The experimental results show that the
proposed algorithm can obtain more accurate testing acoustic
models and outperforms the traditional linear regression
method.
Keywords-voicing cut-off frequency; log-spectral linear
regression; robust speech recognition; model adaptation
I. INTRODUCTION
In real-world applications, the pre-trained acoustic
models generally do not match the feature vectors extracted
from the testing speech due to the background noise and
other variability in speech signals, which often causes the
performance degradation of speech recognition systems [1].
Therefore, it is necessary to take some algorithms to reduce
the impact of the environmental mismatch and improve the
recognition performance [2].
The speech signals are divided into a series of
overlapping frames in speech signal processing. For most
speech frames, the harmonic structure is most pronounced in
the lower part of the spectrum, which has motivated
researchers to split the spectrum in two distinct parts: a low-
frequency harmonic part and an aperiodic high-frequency
part [3]. The separation between both parts is referred to as
voicing cut-off frequency (VCO) [4]. In general, the high-
frequency part is easily affected by the background noise
and can not provide the effective information for speech
recognition in noisy environments. Thereby, we only
consider the low-frequency part in the feature extraction of
noisy speech, which can further improve the performance of
model adaptation.
The VCO of each frame is usually different from those
of other frames in speech signals. However, the speech
recognition system often employs the hidden Markov model
(HMM) as the acoustic model and considers the relationship
among the different frames of each utterance during the
recognition process. In addition, the HMMs are produced
using the training speech features which include the whole
spectrum of each frame. To adapt the pre-trained acoustic
model to match the noisy testing speech, an average VCO is
employed for feature extraction of each utterance in noisy
testing conditions. In the estimation of the average VCO, we
only consider the frames whose energy is close to the
maximum value of the utterance and ignore other frames.
This paper proposes a log-spectral linear regression
(LLR) algorithm based on voicing cut-off frequency for
robust speech recognition. In the algorithm, the training
features including the whole speech spectrum are used to
train the acoustic model of each speech unit, and the noisy
testing speech spectrum is classified as either voiced or
unvoiced based on the average VCO. For the sake of feature
extraction, the VCO is approximated at the upper frequency
of every channel of the Mel filter bank and the noisy testing
features only include the Mel channels below the average
VCO of the speech utterance. To obtain the testing acoustic
model, we first convert the pre-trained HMM to the log-
spectral domain by the inverse discrete cosine transform
(DCT) and ignore the high-frequency part of the training
mean and variance. Then it is assumed that the testing mean
and variance can be obtained by the log-spectral linear
regression. The linear regression parameters are estimated
from small amounts of adaptive data using the expectation–
maximization (EM) algorithm under the maximum
likelihood criterion.
II. L
OG-SPECTRAL LINEAR REGRESSION BASED ON VCO
In the testing environment, the noisy speech spectrum is
classified as either voiced or unvoiced at the upper
frequency of every Mel channel, based on the voicing cut-
off frequency estimation method [4]. For each voiced
speech frame with a pitch frequency, we estimate the
number of voiced pitch harmonics and obtain the VCO as
the product of the pitch frequency and harmonic number.
Then the VCO is averaged over all frames of the voiced
speech segment and is further approximated at the upper
frequency of every channel of the Mel filter bank.
This paper only considers the Mel channels below the
average VCO of the speech segment in the feature
extraction of the testing speech. Therefore the channel
number of the testing speech is less than that of the training
speech. Hoverer, the acoustic model of every speech unit is
trained by the clean training speech and thus must be
modified and adapted to match the testing features.