语音识别的深度学习波束形成算法对比研究

需积分: 10 141 浏览量更新于2024-09-08 收藏 1.28MB PDF 举报

"这篇文献是关于学习型波束形成算法在语音识别中的应用研究，旨在提升语音识别的准确性和鲁棒性。文章对比分析了三种专门设计用于增强语音识别的基于学习的波束形成方法：1) 使用神经网络预测从广义交叉相关(GCC)特征得到的波束形成权重；2) 基于神经网络预测时间-频率(TF)掩模来估计MVDR（最小方差无失真响应）波束形成权重；3) 波束形成权重的最大似然估计。" 本文献主要探讨了在语音识别系统中，如何利用波束形成技术改善信号处理效果，从而提高语音识别的准确率。波束形成是一种信号处理技术，它通过阵列天线或麦克风阵列接收并处理来自不同方向的声音信号，增强目标信号，抑制干扰信号。在语音识别领域，尤其是在噪声环境中，这一技术尤为重要。第一种方法是利用神经网络预测GCC特征对应的波束形成权重。GCC是评估两个信号之间相位相关性的方法，通常用于双耳听觉模型和声源定位。通过训练神经网络，可以学习到如何根据GCC特征有效地调整波束形成权重，以最大化目标语音信号的能量，同时减少背景噪声。第二种方法引入了神经网络预测TF掩模。TF掩模是在时频域中标识出语音成分的二进制掩模，它可以用来分离出目标语音信号与噪声。MVDR波束形成器则根据这个掩模估计最优的加权系数，以实现对目标语音的精确恢复，同时最小化其他非目标信号的影响。第三种方法是波束形成权重的最大似然估计。这是一种统计方法，通过估计最可能产生观测数据的参数值来优化波束形成器。在语音识别场景下，这通常意味着寻找能够最大化识别概率的波束形成权重。这些方法都试图在保持语音识别率的同时，提高系统的抗噪能力。通过对比分析，作者们可以评估各种方法在不同环境和条件下的性能，为实际应用提供指导。该研究对于开发更高效、适应性强的语音识别系统具有重要意义，特别是在车载通信、智能家居、智能助理等应用场景中，噪声环境下的语音识别准确性是关键的技术挑战。

A Study of Learning Based Beamforming Methods for Speech Recognition

Xiong Xiao

, Chenglin Xu

, Zhaofeng Zhang

, Shengkui Zhao

, Sining Sun

, Shinji Watanabe

Longbiao Wang

, Lei Xie

, Douglas L. Jones

, Eng Siong Chng

, Haizhou Li

7,1

Nanyang Technological University (NTU), Singapore,

Nagaoka University of Technology, Japan,

Advanced Digital Sciences Center, Singapore,

Northwestern Polytechnical University, China,

Mitsubishi Electric Research Laboratories, USA,

Tianjin University, China,

National University of Singapore, Singapore.

{xiaoxiong, xuchenglin}@ntu.edu.sg, s147002@stn.nagaokaut.ac.jp, shengkui.zhao@adsc.com.sg

Abstract

This paper presents a comparative study of three learning based

beamforming methods that are speciﬁcally designed for robust

speech recognition. The three methods are 1) neural network

that predicts beamforming weights from generalized cross cor-

relation (GCC) features; 2) neural network that predicts time-

frequency (TF) mask which is used to estimate MVDR (min-

imum variance distortionless response) beamforming weights;

3) maximum likelihood estimation of beamforming weights to

ﬁt enhanced features to clean trained Gaussian mixture model.

All three methods operate in frequency domain. They are

evaluated on the CHiME-4 benchmarking speech recognition

task and compared with traditional delay-and-sum and MVDR

beamforming methods on the same speech recognition task.

Discussions and future research directions are presented.

1. Introduction

Beamforming is an important approach to improve the perfor-

mance of automatic speech recognition (ASR) in far ﬁeld sce-

narios.. Traditional beamforming methods enhance the speech

signals to improve signal level criteria, e.g. the signal-to-noise

ratio (SNR) of output signal. As these criteria are not directly

related to the ASR’s performance measure, tradiitonal methods

are usually not optimized for the ASR task.

Recently, several learning based beamforming methods are

proposed for the ASR task. By learning based methods, we

mean these methods learn from a large amount of training data

(single or multi-channel), and apply the learned knowledge at

run time to estimate parameters for ASR, e.g. beamforming

weights. In one approach [1–3], multi-channel raw waveforms

are fed into the neural network acoustic model directly. A tem-

poral convolution layer at the bottom of the network is used to

approximate the ﬁlter-and-sum beamforming operation. After

training, the temporal convolution layer learnes a ﬁxed bank of

spatial and temporal ﬁlters, each with speciﬁc looking direc-

tions. We call this approach the spatial ﬁlter learning approach.

In another approach, beamforming ﬁlter weights are predicted

by neural networks that are jointly optimized with the acous-

tic model networks. Deep neural network (DNN) is used to

predict beamforming weights in frequency domain from gen-

eralized cross correlation (GCC) features [4] or spatial covari-

ance matrix (SCM) features [5]. In [6], long short-term memory

(LSTM) networks are used to predict the beamforming weights

in the time domain directly which has less number of free pa-

rameters than the frequency domain. We call this appraoch the

spatial ﬁlter prediction approach. While the ﬁlter learning ap-

proach learns a ﬁxed set of spatial ﬁlters, the ﬁlter prediction ap-

proach predicts spatial ﬁlters dynamically from the input data.

In another approach, neural networks are used to predict time-

frequency (TF) mask that speciﬁes whether a TF bin is dom-

inated by speech or noise. The TF mask is used to help esti-

mating the speech and noise SCMs required by beamforming

methods, such as the minimum variance distortionless response

(MVDR) [7, 8] and generalized eigenvalue (GEV) [9, 10] beam-

formers. The mask predicting network can be trained by using

ideal masks as target [11–13] or by minimizing the ASR cost

function [14]. The ﬁlter learning, ﬁlter predicting, and mask

predicting approaches are called discriminative approach in this

paper, as the models are trained to minimize the ASR error.

Besides discriminative methods, there is also learning based

beamforming methods based on generative modeling of speech

features. In [15, 17], a method called LIMABEAM estimates

time or frequency domain ﬁlter-and-sum weights to maximize

the likelihood of the enhanced feature vectors on clean trained

HMM/GMM acoustic model. In the unsupervised implemen-

tation, multi-pass decoding is required, where the ﬁrst pass de-

coding provides the hypothesized text used to obtain HMM state

alignment. Beamforming weights can be estimated iteratively

to maximize the likelihood of the enhanced features given the

state alignment. It is reported that LIMABEAM outperforms

delay-and sum beamforming in several ASR tasks.

Although several learning based methods have been pro-

posed in the past, they are usually implemented by different re-

searchers and evaluated on different ASR tasks. As a result, it is

difﬁcult to compare their performance. In this paper, we attempt

to study three learning based beamforming methods compara-

tively, with the implementation in the same toolkit, i.e. Signal-

Graph [25], and evaluation in the same task, i.e. the CHiME-4

speech recognition task [16]. The three methods include a max-

imum likelihood (ML) beamforming (a variant of LIMABEAM

[15]), the spatial ﬁlter weight predicting network [4], and the

mask predicting network [14].

2. Learning Based Beamforming Methods

2.1. Spatial Filter Weight Predicting Network

The system diagram of the spatial ﬁlter weight predicting net-

work [4] is shown in Fig. 1. On the bottom left of the ﬁgure,

a network is used to predict the beamforming weights in fre-

quency domain. The weights are then applied on the multi-

channel inputs to generate enhanced speech, from which fea-

tures are extracted for acoustic modeling.

下载后可阅读完整内容，剩余5页未读，立即下载

liuyunkongjimo

粉丝: 0
资源: 4

语音识别的深度学习波束形成算法对比研究

数字波束形成和自适应算法中的各种算法music 、capon、LCMV、mvdr等

dnnmatlab代码-nn-gev:神经网络支持的GEV波束形成器

基于MATLAB实现的稳健的波束形成算法，处理在误差条件下稳健的自适应波束形成算法+使用说明文档.rar

基于空时处理的稳健自适应波束形成算法

基于GPS循环平稳特性的盲波束形成算法改进.pdf

基于MATLAB实现的稳健的自适应波束形成算法，常用mse准则算法+使用说明文档.zip

MATLAB稳健波束形成算法实现及使用说明

MATLAB实现对角加载波束形成算法及使用指南

【DOA估计】基于matlab宽带CBF和MVDR波束形成算法DOA估计【含Matlab源码 8823期】.zip

基于自适应波束形成的高维数据挖掘算法.pdf

最新资源