Chinese Journal of Electronics
Vol.20, No.1, Jan. 2011
Robust Feature Extraction for Speech
Recognition Based on Perceptually
Motivated MUSIC and CCBC
∗
HAN Zhiyan
1
, WANG Jian
1
, WANG Xu
2
and LUN Shuxian
1
(1.Colle g e of Information Science and Engineering, Bohai University, Jinzhou 121000, China)
(2.Colle ge of Information Science and Engineering, Northeastern University, Shenyang 110004, China)
Abstract — A novel feature extraction algorithm was
proposed to improve the robustness of speech recognition.
Core technology was incorporating perceptual information
into the Multiple signal classification (MUSIC) spectrum,
it provided improved robustness and computational effi-
ciency comparing with the Mel frequency cepstral coef-
ficient (MFCC) technique, then the cepstrum coefficients
were extracted as the feature parameter. The effectiveness
of the parameter was discussed in view of the class sepa-
rability and speaker variability properties. To improve the
robustness, we considered incorporating Canonical corre-
lation based compensation (CCBC) to cope with the mis-
match between training and test set. We evaluated the
technique using improved Back-propagation neural net-
works (BPNN) in three different tasks: in different speak-
ers, different recording channels and different noisy envi-
ronments. The experimental results show that the novel
feature has well robustness and effectiveness relative to
MFCC and the CCBC algorithm can make speech recog-
nition system robust in all three kinds of mismatch.
Key words — Speech recognition, Multiple signal clas-
sification (MUSIC), Canonical correlation based on com-
pensation (CCBC), Feature extraction
I. Introduction
The research on the robustness of speech recognition is
still a challenging task, especially in the development of core
speech processing algorithms. One example is almost all cur-
rent speech recognition systems use MFCC
[1]
as the acoustic
front-end. Many researchers would agree that it is a signifi-
cant issue to formulate an efficient acoustic front-end signal,
especially in noise while eliminating irrelevant information
[2]
.
Estimating the time-varying spectrum is a key first step
in the acoustic front-end. The spectrum is often based on
perceptual considerations, such as Mel and Bark scales, and
incorporated into the acoustic front-end to improve accuracy,
MFCC is such a feature set.
MFCC is an effective feature for ASR. It is computed
by applying a Mel-scaled filter bank either to the short-time
Fast Fourier transform (FFT) magnitude spectrum or to the
short-term LPC-based spectrum. However, both FFT and
LPC-based spectra are very sensitive to noise contamination.
Eigenvector-based methods such as MUSIC are popular in si-
nusoidal frequency estimation due to their high resolution and
less prior information. Moreover, this algorithm has well noise
restraining ability. So we adopted the MUSIC incorporating
perceptual information directly into the spectrum estimation
to improve cepstral representation in noise. Recognition tests
demonstrate the robustness of this method
[3,4]
.
It is a significant issue to resolve the performance of ASR
system degrades severely in a serious mismatch between train-
ing and test conditions. The mismatch can be simply clus-
tered into three classes: differences of speakers, changes of
recording channel and effects of noisy environment. In this
paper, we utilized CCBC to compensate three kinds of dis-
tortion sources, because the calculating procedure of CCBC is
specific and short and it reconstructs the correct correlation
between training vectors and test vectors
[5]
.
II. Algorithm Description
1. Description of perceptual warping
(1) Direct warping of the FFT spectrum
Using a non-linearly spaced filter bank to incorporate per-
ceptual traits into the acoustic front-end is a well-established
technique. The main aim of the filter bank is to average out
the harmonic information that exists in the FFT spectrum and
to track the spectral envelope. But, the filter bank produces
a gross spectrum that carries substantial pitch information
which is not desirable. It is shown that MUSIC is an appro-
priate spectral envelope modeling, and it is useful and safe to
remove filter bank structure and incorporate perceptual con-
sideration directly into the FFT spectrum.
One way of incorporating perceptual considerations is to
implement the perceptual scale through a first order all-pass
∗
Manuscript Received Feb. 2009; Accepted Oct. 2010. This work is supported by the National Natural Science Foundaton of China
(No.60974071).