基于cGMM的在线MVDR波束形成器：噪声鲁棒语音识别中的关键技术

需积分: 16 94 浏览量更新于2024-09-08 收藏 550KB PDF 举报

本文档探讨了一种基于复杂高斯混合模型（Complex Gaussian Mixture Model, cGMM）的在线MVDR（Minimum Variance Distortionless Response）波束形成器在噪声鲁棒自动语音识别（Automatic Speech Recognition, ASR）中的应用。MVDR波束形成器是一种信号处理技术，其目标是通过增强来自特定方向的声音分量，有效地抑制背景噪声，从而提高语音信号的质量和识别性能。传统的波束形成器依赖于准确的 steering vector（指向向量）估计，这是实现噪声减小的关键因素。过去的研究中，时频掩码（time-frequency masking）方法被提出用于估计这些向量，这种方法允许更灵活地适应信号特性。论文作者开发了一种新的时频掩码估计方法，它利用了cGMM来构建一个语音谱模型。cGMM是一种统计建模工具，能够捕获语音信号的多模态特性，有助于更好地理解信号的潜在结构。与传统方法不同，该研究将cGMM应用于在线场景，这意味着波束形成器能够在处理连续语音信号的同时实时更新向量估计，提高了实时性。这种在线处理能力对于噪声环境下的ASR系统尤其重要，因为实时噪声变化可能需要快速调整波束形成策略。此外，通过结合CGMM的统计优势和MVDR的噪声抑制效果，该研究提出的方法理论上能够提供更好的噪声抑制性能，同时保持语音信号的原始质量，从而提升ASR系统的整体性能。论文作者Takuya Higuchi、Nobutaka Ito等人的工作不仅深化了我们对前端信号处理的理解，也为噪声环境下高效语音处理技术的发展提供了新的思路和方法。总结来说，这篇论文的核心贡献在于提出了一种基于cGMM的在线MVDR波束形成器，它通过利用复杂的语音模型和实时的时频掩码估计，实现了噪声环境下的ASR系统的优化。这一创新技术对于提升现代语音技术在嘈杂环境下的应用具有重要意义。

782 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 4, APRIL 2017

represented as

(t)=

(k)

(τ)s

(k)

(t − τ )+n

(t), (1)

where s

(k)

(t) and n

(t) denote the k-th source and a noise

signal recorded at m-th microphone, respectively, and h

(k)

(τ)

corresponds to an impulse response between the k-th source and

the m-th microphone.

By applying a short-time Fourier transform (STFT), (1) can

be denoted in the time-frequency domain as

(f,t)=

(k)

(f)s

(k)

(f,t)+n

(f,t), (2)

where y

(f,t), h

(k)

(f), s

(k)

(f,t) and n

(f,t) denote time-

frequency domain representations of y

(t), h

(k)

(τ), s

(k)

(t) and

(t),respectively.Hereweassumethelengthofanimpulsere-

sponse to be much shorter than that of an STFT window. Hence

aconvolutionbetweentheimpulseresponseandthesourcesig-

nal in the time domain can be represented as the product of a

time-invariant frequency response and the time-variant source

signal in the time-frequency domain. This STFT domain ex-

pression leads to a computationally-efﬁcient algorithm for the

source separation problem [16]. Introducing a vector notation,

(2) can be rewritten as

y(f,t)=

(k)

(f)s

(k)

(f,t)+n(f,t), (3)

where

y(f,t)=[y

(f,t),...,y

(f,t)]

, (4)

(k)

(f)=[h

(k)

(f),...,h

(k)

(f)]

, (5)

n(f,t)=[n

(f,t),...,n

(f,t)]

. (6)

Superscript T denotes non-conjugate transposition. r

(k)

(f) de-

notes frequency responses between k-th source and micro-

phones, which is often called a steering vector. The goal of

the source separation (or speech enhancement) problem is to

recover each target source signal s

(k)

(f,t) from observed sig-

nal y(f, t) where the source signals are mixed and corrupted by

noise. In the following, we use subscripts to denote f and t to

simplify notation.

IV. O

VERVIEW OF OUR MICROPHONE ARRAY SYSTEM

Fig. 1 shows a diagram of our microphone array system archi-

tecture. The system inputs consist of noise-corrupted and mixed

speech signals that are captured by the microphone array. The

system comprises a beamformer, a steering vector estimator,

and a time-frequency mask estimator. These three components

combine to generate an enhanced speech signal.

This section brieﬂy reviews beamforming and steering vector

estimation based on time-frequency masks, which is followed in

the next section by a detailed explanation of our time-frequency

masking.

Fig. 1. Schematic diagram of our microphone array system architecture.

A. Beamforming

The assumed architecture performs MVDR beamforming to

enhance a speech signal in the STFT domain. The beamformer

applies a linear ﬁlter w

(k)

to the microphone signal vector to

produce an enhanced k-th s peech signal, ˆs

(k)

f,t

,as

ˆs

(k)

f,t

= w

(k)

f,t

, (7)

where superscript H denotes a conjugate transposition. By min-

imizing the beamformer output variance subject to w

(k)

1,theﬁlterforthek-th source, w

(k)

,isdeterminedas[17]:

(k)

(y)

−1

(k)

(y)

−1

(k)

, (8)

where R

(y)

denotes the covariance matrix of observed signals

calculated by

(y)

f,t

. (9)

It should be noted that our framework can also be used with

other beamformers such as a multichannel Wiener ﬁlter.

B. Steering Vector Estimation

The key to successful beamforming lies in the accurate esti-

mation of the steering vector. Conventional beamformers often

obtain the steering vector by using DOA estimates and the plane

wave propagation assumption, which holds only for an ideal

anechoic space. Using the DOA estimates could also degrade

the noise reduction performance as their estimation accuracy

deteriorates when SNRs are low.

Our approach does not use such errorful prior knowledge to

obtain an accurate estimate of the steering vector. The basic idea

is to estimate the steering vector directly using the covariance

剩余13页未读，继续阅读

wangzc1989

粉丝: 0
资源: 4

基于cGMM的在线MVDR波束形成器：噪声鲁棒语音识别中的关键技术

cgmm_mvdr(Author Sining Sun )

dbf_mvdr.zip_DBF_MVDR beamformer_MVDR波束形成_managedqyb_最小方差算法

matlab中噪声代码-cgmm_mvdr:cgmm_mvdr

MVDR_SMI.rar_MVDR波束形成_SMI beamformer _mvdr波束_smi波束形成

Multi-Microphone Noise Reduction Based on Orthogonal Noise Signal Decompositions

Single-snapshot DOA estimation based on compressed sensing in PCR systems

A 2-D DOA Estimation Method Based on Blind Source Separation

Multiple concurrent sources localization based on a two-node distributed acoustic sensor network

MVDR.rar_MVDR estimation_MVDR-MFCC_MVDR功率谱_MVDR算法_mvdr谱估计

mvdr2.rar_MVDR beam_beam forming mvdr_mvdr _mvdr 水声_水声

最新资源