A Study of Learning Based Beamforming Methods for Speech Recognition
Xiong Xiao
1
, Chenglin Xu
1
, Zhaofeng Zhang
2
, Shengkui Zhao
3
, Sining Sun
4
, Shinji Watanabe
5
Longbiao Wang
6
, Lei Xie
4
, Douglas L. Jones
3
, Eng Siong Chng
1
, Haizhou Li
7,1
1
Nanyang Technological University (NTU), Singapore,
2
Nagaoka University of Technology, Japan,
3
Advanced Digital Sciences Center, Singapore,
4
Northwestern Polytechnical University, China,
5
Mitsubishi Electric Research Laboratories, USA,
6
Tianjin University, China,
7
National University of Singapore, Singapore.
{xiaoxiong, xuchenglin}@ntu.edu.sg, s147002@stn.nagaokaut.ac.jp, shengkui.zhao@adsc.com.sg
Abstract
This paper presents a comparative study of three learning based
beamforming methods that are specifically designed for robust
speech recognition. The three methods are 1) neural network
that predicts beamforming weights from generalized cross cor-
relation (GCC) features; 2) neural network that predicts time-
frequency (TF) mask which is used to estimate MVDR (min-
imum variance distortionless response) beamforming weights;
3) maximum likelihood estimation of beamforming weights to
fit enhanced features to clean trained Gaussian mixture model.
All three methods operate in frequency domain. They are
evaluated on the CHiME-4 benchmarking speech recognition
task and compared with traditional delay-and-sum and MVDR
beamforming methods on the same speech recognition task.
Discussions and future research directions are presented.
1. Introduction
Beamforming is an important approach to improve the perfor-
mance of automatic speech recognition (ASR) in far field sce-
narios.. Traditional beamforming methods enhance the speech
signals to improve signal level criteria, e.g. the signal-to-noise
ratio (SNR) of output signal. As these criteria are not directly
related to the ASR’s performance measure, tradiitonal methods
are usually not optimized for the ASR task.
Recently, several learning based beamforming methods are
proposed for the ASR task. By learning based methods, we
mean these methods learn from a large amount of training data
(single or multi-channel), and apply the learned knowledge at
run time to estimate parameters for ASR, e.g. beamforming
weights. In one approach [1–3], multi-channel raw waveforms
are fed into the neural network acoustic model directly. A tem-
poral convolution layer at the bottom of the network is used to
approximate the filter-and-sum beamforming operation. After
training, the temporal convolution layer learnes a fixed bank of
spatial and temporal filters, each with specific looking direc-
tions. We call this approach the spatial filter learning approach.
In another approach, beamforming filter weights are predicted
by neural networks that are jointly optimized with the acous-
tic model networks. Deep neural network (DNN) is used to
predict beamforming weights in frequency domain from gen-
eralized cross correlation (GCC) features [4] or spatial covari-
ance matrix (SCM) features [5]. In [6], long short-term memory
(LSTM) networks are used to predict the beamforming weights
in the time domain directly which has less number of free pa-
rameters than the frequency domain. We call this appraoch the
spatial filter prediction approach. While the filter learning ap-
proach learns a fixed set of spatial filters, the filter prediction ap-
proach predicts spatial filters dynamically from the input data.
In another approach, neural networks are used to predict time-
frequency (TF) mask that specifies whether a TF bin is dom-
inated by speech or noise. The TF mask is used to help esti-
mating the speech and noise SCMs required by beamforming
methods, such as the minimum variance distortionless response
(MVDR) [7, 8] and generalized eigenvalue (GEV) [9, 10] beam-
formers. The mask predicting network can be trained by using
ideal masks as target [11–13] or by minimizing the ASR cost
function [14]. The filter learning, filter predicting, and mask
predicting approaches are called discriminative approach in this
paper, as the models are trained to minimize the ASR error.
Besides discriminative methods, there is also learning based
beamforming methods based on generative modeling of speech
features. In [15, 17], a method called LIMABEAM estimates
time or frequency domain filter-and-sum weights to maximize
the likelihood of the enhanced feature vectors on clean trained
HMM/GMM acoustic model. In the unsupervised implemen-
tation, multi-pass decoding is required, where the first pass de-
coding provides the hypothesized text used to obtain HMM state
alignment. Beamforming weights can be estimated iteratively
to maximize the likelihood of the enhanced features given the
state alignment. It is reported that LIMABEAM outperforms
delay-and sum beamforming in several ASR tasks.
Although several learning based methods have been pro-
posed in the past, they are usually implemented by different re-
searchers and evaluated on different ASR tasks. As a result, it is
difficult to compare their performance. In this paper, we attempt
to study three learning based beamforming methods compara-
tively, with the implementation in the same toolkit, i.e. Signal-
Graph [25], and evaluation in the same task, i.e. the CHiME-4
speech recognition task [16]. The three methods include a max-
imum likelihood (ML) beamforming (a variant of LIMABEAM
[15]), the spatial filter weight predicting network [4], and the
mask predicting network [14].
2. Learning Based Beamforming Methods
2.1. Spatial Filter Weight Predicting Network
The system diagram of the spatial filter weight predicting net-
work [4] is shown in Fig. 1. On the bottom left of the figure,
a network is used to predict the beamforming weights in fre-
quency domain. The weights are then applied on the multi-
channel inputs to generate enhanced speech, from which fea-
tures are extracted for acoustic modeling.