782 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 4, APRIL 2017
represented as
y
m
(t)=
!
k
!
τ
h
(k)
m
(τ)s
(k)
(t − τ )+n
m
(t), (1)
where s
(k)
(t) and n
m
(t) denote the k-th source and a noise
signal recorded at m-th microphone, respectively, and h
(k)
i
(τ)
corresponds to an impulse response between the k-th source and
the m-th microphone.
By applying a short-time Fourier transform (STFT), (1) can
be denoted in the time-frequency domain as
y
m
(f,t)=
!
k
h
(k)
m
(f)s
(k)
(f,t)+n
m
(f,t), (2)
where y
m
(f,t), h
(k)
m
(f), s
(k)
(f,t) and n
m
(f,t) denote time-
frequency domain representations of y
m
(t), h
(k)
m
(τ), s
(k)
(t) and
n
m
(t),respectively.Hereweassumethelengthofanimpulsere-
sponse to be much shorter than that of an STFT window. Hence
aconvolutionbetweentheimpulseresponseandthesourcesig-
nal in the time domain can be represented as the product of a
time-invariant frequency response and the time-variant source
signal in the time-frequency domain. This STFT domain ex-
pression leads to a computationally-efficient algorithm for the
source separation problem [16]. Introducing a vector notation,
(2) can be rewritten as
y(f,t)=
!
k
r
(k)
(f)s
(k)
(f,t)+n(f,t), (3)
where
y(f,t)=[y
1
(f,t),...,y
M
(f,t)]
T
, (4)
r
(k)
(f)=[h
(k)
1
(f),...,h
(k)
M
(f)]
T
, (5)
n(f,t)=[n
1
(f,t),...,n
M
(f,t)]
T
. (6)
Superscript T denotes non-conjugate transposition. r
(k)
(f) de-
notes frequency responses between k-th source and micro-
phones, which is often called a steering vector. The goal of
the source separation (or speech enhancement) problem is to
recover each target source signal s
(k)
(f,t) from observed sig-
nal y(f, t) where the source signals are mixed and corrupted by
noise. In the following, we use subscripts to denote f and t to
simplify notation.
IV. O
VERVIEW OF OUR MICROPHONE ARRAY SYSTEM
Fig. 1 shows a diagram of our microphone array system archi-
tecture. The system inputs consist of noise-corrupted and mixed
speech signals that are captured by the microphone array. The
system comprises a beamformer, a steering vector estimator,
and a time-frequency mask estimator. These three components
combine to generate an enhanced speech signal.
This section briefly reviews beamforming and steering vector
estimation based on time-frequency masks, which is followed in
the next section by a detailed explanation of our time-frequency
masking.
Fig. 1. Schematic diagram of our microphone array system architecture.
A. Beamforming
The assumed architecture performs MVDR beamforming to
enhance a speech signal in the STFT domain. The beamformer
applies a linear filter w
(k)
f
to the microphone signal vector to
produce an enhanced k-th s peech signal, ˆs
(k)
f,t
,as
ˆs
(k)
f,t
= w
(k)
f
H
y
f,t
, (7)
where superscript H denotes a conjugate transposition. By min-
imizing the beamformer output variance subject to w
(k)
f
H
r
(k)
f
=
1,thefilterforthek-th source, w
(k)
f
,isdeterminedas[17]:
w
(k)
f
=
R
(y)
f
−1
r
f
r
(k)
f
H
R
(y)
f
−1
r
(k)
f
, (8)
where R
(y)
f
denotes the covariance matrix of observed signals
calculated by
R
(y)
f
=
1
T
!
t
y
f,t
y
H
f,t
. (9)
It should be noted that our framework can also be used with
other beamformers such as a multichannel Wiener filter.
B. Steering Vector Estimation
The key to successful beamforming lies in the accurate esti-
mation of the steering vector. Conventional beamformers often
obtain the steering vector by using DOA estimates and the plane
wave propagation assumption, which holds only for an ideal
anechoic space. Using the DOA estimates could also degrade
the noise reduction performance as their estimation accuracy
deteriorates when SNRs are low.
Our approach does not use such errorful prior knowledge to
obtain an accurate estimate of the steering vector. The basic idea
is to estimate the steering vector directly using the covariance