Technical Note
A cepstrum-based preprocessing and postprocessing for
speech enhancement in adverse environments
Xiaohu Hu, Shiwei Wang, Chengshi Zheng
⇑
, Xiaodong Li
Communication Acoustics Laboratory, Institute of Acoustics, Chinese Academy of Sciences, 100190 Beijing, China
article info
Article history:
Received 10 April 2013
Received in revised form 23 May 2013
Accepted 5 June 2013
Keywords:
Cepstral analysis
Speech enhancement
Noise estimation
abstract
This paper proposes a cepstrum-based preprocessing and postprocessing algorithm for single-channel
speech enhancement. The cepstrum-based preprocessing scheme is applied to reduce the impact of
the voiced speech on estimating the noise power spectral density (NPSD), which results in avoiding over-
estimating the NPSD by eliminating harmonic components of the voiced speech when tracking non-sta-
tionary noise components. The cepstrum-based postprocessing scheme is used to suppress both some
non-stationary noise components and the annoying musical noise without introducing audible speech dis-
tortion. Experimental results show that the proposed algorithm could track non-stationary noise effec-
tively without overestimating the NPSD. Moreover, the proposed algorithm achieves better
performance in terms of both the segmental signal-to-noise-ratio improvement and the PESQ
improvement.
Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction
In single-channel speech enhancement systems, it is well-
known that there are two open problems for spectral subtraction
[1,2]. One is how to estimate the noise power spectral density
(NPSD) in adverse environments, the other is how to suppress
the non-stationary noise components effectively even when the
NPSD is severely underestimated. Researchers have made great
efforts to solve these two problems during the last four decades
[3–9,11–19].
It is a non-trivial task to estimate the NPSD from the noisy
speech, especially when the noise is extremely non-stationary.
Generally, there are two categories of algorithms in estimating
the NPSD. One is updating the NPSD in noise-only segments, where
an accurate voice activity detection (VAD) algorithm is often
needed and important [2]. The other could update the NPSD in
not only non-speech segments but also speech segments, and this
non-VAD algorithm is more attractive and popular for its capability
of tracking NPSD in speech segments [4–9,11]. Recently, lots of
algorithms have been proposed to track non-stationary noise. Mar-
tin proposed the well-known minimum statistics (MS) method,
which could track decreasing noise levels immediately while it
has a large delay in tracking increasing noise levels [4]. Cohen pro-
posed the minima controlled recursive averaging (MCRA) method
to improve the tracking capability of the MS method [5]. Both
the MS method and the MCRA method were further improved by
Rangachari and Loizou [6].In[8], Hendriks et al. proposed a low-
complexity MMSE estimator of the NPSD. A relative complete eval-
uation of these NPSD methods can be found in [11].
It is an inevitable problem that the NPSD is often underesti-
mated in adverse environments for single-channel speech
enhancement systems. The residual noise may become more
unpleasant to the ear when only the stationary noise components
are totally suppressed, which is due to that the dynamic range of
the noise becomes larger than before [20]. To make the residual
noise sound natural, numerous algorithms have been proposed in
the last two decades. Some researchers suggested to preserve a
certain amount of background noise, where this scheme could sim-
ply reduce the dynamic range of the residual noise. In [13,14], the
auditory masking properties were applied to suppress more non-
stationary noise components. Breithaupt et al. used the cepstral
smoothing technique to suppress both the musical noise and some
non-stationary noise components [15–17]. Wang et al. proposed to
use the modified cepstrum thresholding (MCT) technique to
achieve the same objective [18].
In this paper, we propose a new scheme to improve the tracking
capability of the existing NPSD estimation methods, where this
scheme is based on the fact that the voiced speech often lasts a
long time. A cepstrum-based preprocessing scheme is proposed
to suppress the harmonic components of the voiced speech before
estimating the NPSD, where this scheme is somewhat motivated
by recent works in analyzing the theoretical properties of cepstral
coefficients [21–24]. Experimental results verify that the proposed
cepstrum-based preprocessing scheme could track the non-sta-
tionary noise and avoid overestimating the NPSD simultaneously.
0003-682X/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.apacoust.2013.06.001
⇑
Corresponding author. Tel.: +86 10 82547945; fax: +86 10 62553898.
E-mail address: cszheng@mail.ioa.ac.cn (C. Zheng).
Applied Acoustics 74 (2013) 1458–1462
Contents lists available at SciVerse ScienceDirect
Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust