
VTS feature compensation based on two-layer GMM
structure for robust speech recognition
Lin Zhou, Haijing Li, Ying Chen, Zhenyang Wu
Key Laboratory of Underwater Acoustic Signal Processing
of Ministry of Education
School of Information Science and Engineering, SEU
Nanjing, China
Linzhou@seu.edu.cn, 1025784430@qq.com,
476141905@qq.com,zhenyang@seu.edu.cn
Yong Lu
College of Computer and Information Engineering
Hohai University
Nanjing, China
yonglu@hhu.edu.cn
Abstract—In this paper, a two-layer Gaussian Mixed Model
(GMM) structure for Vector Taylor Series (VTS) feature
compensation is proposed for robust speech recognition. Since
GMM with the numerous mixture components is used for VTS,
the computation complexity of VTS is extremely huge. To deal
with this issue, we propose two-layer GMM structure for VTS. In
detail, the GMM with fewer mixture components is utilized to
estimate the mean and variance of noise. With the estimated
noise parameters, the second GMM with more mixtures is
employed to map noisy features to clean features. The simulation
results show that the proposed algorithm significantly reduces
the computation complexity of VTS. Meanwhile, its performance
is well performed as that of the traditional system.
Keywords—GMM model; Vector Taylor Series; feature
compensation; speech recognition
I. INTRODUCTION
In real application, the performance of speech recognition
system degrades rapidly with the environmental noise and
speech variance. To address this problem, feature
compensation and model adaptation algorithms are the focus
research of robust speech recognition. For example, Stereo-
based Piecewise Linear Compensation for Environments
(SPLICE)
[1] is presented as a model-based feature
compensation, which estimates clean speech features from
noisy speech. Other methods, e.g., Maximum likelihood linear
regression (MLLR) [2, 3], maximum a posteriori (MAP)
[4]
and Maximum a posteriori linear regression (MAPLR)
[5]
are
utilized to deal with the degraded speech by the adaptation
model. Although the aforementioned methods have better
performance, it is proven in [6] that the parallel model
combination (PMC) [7] and vector Taylor series (VTS) [8, 9]
can outperform the existing methods. In VTS algorithm, noisy
speech features are represented through a first-order linear
approximation and thus clean speech features are estimated by
expectation-maximization (EM) approach.
However, the above feature compensation and model
adaptation algorithms [10] pay more attention to the
performance improvement, they seldom take computation
complexity into account, which limits the practical applications.
To deal with this problem, a two-layer GMM structure is
proposed to optimize the traditional VTS structure. Two
GMMs with different number of mixtures are built. One GMM
with fewer mixtures is firstly utilized to estimate the mean and
variance of noise based on Maximum Likelihood (ML)
criterion. Then the second GMM with more mixtures is
employed to estimate clean features from noisy speech. As a
result, the proposed algorithm significantly reduced the
computation of VTS. Meanwhile, the speech recognition
accuracy is still comparable with the traditional VTS algorithm.
The rest of this paper is organized as follows: Section II
analyzes the traditional VTS algorithm in detail. Section III
presents the two-layer GMM based VTS feature compensation.
Experimental results are given in Section IV and followed by
conclusions in Section V.
II. P
ERFORMANCE ANALYSIS OF THE TRADITIONAL VTS
In the cepstral domain, the relationship between the noisy
speech, clean speech and the additive noise can be expressed as:
1
log(1 exp( ( )))
−
=+ + −
xC C nx
(1)
where y, x and n denote the corresponding cepstral feature of
noisy speech, clean speech and noise, respectively; C and C
−1
denote the discrete cosine transform matrix and its inverse
transform matrix, respectively.
In VTS [11], through the first-order Taylor expansion at the
point (μ
x
, μ
n0
), y can be expressed as:
0
()( )( )=− − + − +
xn
yIUxμ Un μφ (2)
where I is the identity matrix; μ
x
is the mean of x and μ
n0
is the
initial mean of n.
The U and φ are defined:
0
0
0
11
1
1
1
log[exp( ) exp( )]
exp( ( ))
1 exp( ( ))
diag
ϕ
−−
−
−
−
=+
−
=
+−
xn
nx
nx
CCμ C μ
C μμ
UC C
C μμ
(3)
The mean μ
y
and covariance matrix ∑
y
of y is written by:
0
()
()()
TT
=−+
Σ= − Σ − + Σ
ynn
yx n
μ U μμ φ
IU IU UU
(4)
978-1-5090-2859-7/16/$31.00 ©2016 IEEE