2406 I. Cohen, B. Berdugo / Signal Processing 81 (2001) 2403–2418
uncertainty. In Section 4, an expression for the a
priori speech absence probability is formulated,
based on the time–frequency distribution of the a
priori SNR. In Section 5, we present the MCRA
noise estimation approach and propose an appro-
priate speech presence probability function for
controlling the adaptation of the noise spectrum.
Finally, an objective and subjective evaluation of
the OM-LSA and MCRA estimators is performed
in Section 6.
2. Optimal gain modication
Let x(n) and d(n) denote speech and uncorre-
lated additive noise signals, respectively, where n
is a discrete-time index. The observed signal y(n),
given by y(n)=x(n)+d(n), is divided into overlap-
ping frames by the application of a window function
and analyzed using the short-time Fourier transform
(STFT). Specically,
Y (k; ‘)=
N −1
n=0
y(n + ‘M )h(n)e
−j(2)=N )nk
; (1)
where k is the frequency bin index, ‘ is the time
frame index, h is an analysis window of size N (e.g.,
Hanning window), and M is the framing step (num-
ber of samples separating two successive frames).
Let X (k; ‘) denote the STFT of the clean speech,
then its estimate is obtained by applying a specic
gain function to each spectral component of the
noisy speech signal:
ˆ
X (k; ‘)=G(k; ‘)Y (k; ‘): (2)
Using the inverse STFT, with a synthesis window
˜
h that is biorthogonal to the analysis window h [28],
the estimate for the clean speech signal is given by
ˆx(n)=
‘
N −1
k=0
ˆ
X (k; ‘)
˜
h(n
− ‘M )e
j(2)=N )k(n−‘M )
; (3)
where the inverse STFT is eciently implemented
using the weighted overlap-add method [5].
Among various existing speech enhancement
methods, which can be represented by dierent
spectral gain functions, we choose the LSA esti-
mator [8] due to its superiority in reducing musical
noise phenomena. The LSA estimator minimizes
E
{(log A(k; ‘) − log
ˆ
A(k; ‘))
2
};
where A(k; ‘)=|X (k; ‘)| denotes the spectral
speech amplitude, and
ˆ
A(k; ‘) its optimal estimate.
Assuming statistically independent spectral com-
ponents [8], the LSA estimator is dened by
ˆ
A(k; ‘) = exp
{E[log A(k; ‘)|Y (k; ‘)]}: (4)
Given two hypotheses, H
0
(k; ‘) and H
1
(k; ‘),
which indicate, respectively, speech absence and
presence in the kth frequency bin of the ‘th frame,
we have
H
0
(k; ‘): Y (k; ‘)=D(k; ‘);
(5)
H
1
(k; ‘): Y (k; ‘)=X (k; ‘)+D(k; ‘);
where D(k; ‘) represents the STFT of the noise sig-
nal. We assume that the STFT coecients, for both
speech and noise, are complex Gaussian variables
[7]. Accordingly, the conditional PDFs of the ob-
served signal are given by
p(Y (k; ‘)
|H
0
(k; ‘)) =
1
)$
d
(k; ‘)
exp
−
|
Y (k; ‘)|
2
$
d
(k; ‘)
;
p(Y (k; ‘)
|H
1
(k; ‘)) =
1
)($
x
(k; ‘)+$
d
(k; ‘))
× exp
−
|
Y (k; ‘)|
2
$
x
(k; ‘)+$
d
(k; ‘)
;
(6)
where $
x
(k; ‘)=E[|X (k; ‘)|
2
|H
1
(k; ‘)] and $
d
(k; ‘)
= E[
|D(k; ‘)|
2
] denote, respectively, the variances
of speech and noise. Applying Bayes rule for the
conditional speech presence probability, one ob-
tains
P(H
1
(k; ‘)|Y (k; ‘)) =
#(k; ‘)
1+#(k; ‘)
, p(k; ‘); (7)
where #(k; ‘) is the generalized likelihood ratio de-
ned by
#(k; ‘)=
1
− q(k; ‘)
q(k; ‘)
p(Y (k; ‘)
|H
1
(k; ‘))
p(Y (k; ‘)|H
0
(k; ‘))
(8)
and q(k; ‘)
, P(H
0
(k; ‘)) is the a priori probability
for speech absence. Substituting (6) and (8) into