深度堆叠网络在声学分离与音高估计中的对偶算法

175 浏览量更新于2024-08-27 收藏 193KB PDF 举报

"A PAIRWISE ALGORITHM FOR PITCH ESTIMATION AND SPEECH SEPARATION USING DEEP STACKING NETWORK" 本文提出了一种基于深度堆叠网络（Deep Stacking Network, DSN）的双对算法，用于在噪声环境中进行音高估计和语音分离。音高信息对于语音分离至关重要，然而在嘈杂环境下进行音高估计同样是一项具有挑战性的任务。该算法通过整合这两个问题，提供了一种监督学习架构。深度堆叠网络是一种构建深层结构的方法，它将简单的处理模块堆叠起来，以增强模型的表示能力和学习能力。在训练阶段，理想的二进制掩模被用作目标，指导网络学习。输入向量包含了来自下层模块的输出以及帧级特征，这些特征包括频谱特征和基于音高的特征，以全面捕捉语音信号的各个方面。在测试阶段，每个模块独立地对输入信号进行处理，逐步提取并分离出不同的语音成分。通过这种方式，网络能够分别估计不同说话人的音高，并实现语音的分离。由于采用了深度学习，网络可以自动学习到复杂环境下的噪声抑制和音高检测策略，从而提高在实际场景中的性能。此外，文章可能会进一步探讨以下几点： 1. 训练策略：可能涉及到损失函数的选择、优化器的应用以及训练数据的预处理，以确保网络能够有效地学习和泛化。 2. 实验设置：可能包括实验环境、对比方法、评估指标等，以验证所提算法的优越性。 3. 结果分析：通过具体的实验结果展示算法在不同噪声水平和多说话人情况下的性能，可能包括误报率、漏报率、分离度等指标。 4. 应用前景：讨论这种技术在语音识别、会议录音处理、语音增强等领域的潜在应用价值。这篇研究论文提出了一种创新的深度学习方法，将音高估计与语音分离结合，利用深度堆叠网络的强大功能来处理这两个关键的信号处理任务。这种方法有望提高在复杂环境中的语音处理效率和准确性，对于未来的声音处理技术发展具有重要的理论和实践意义。

A PAIRWISE ALGORITHM FOR PITCH ESTIMATION AND SPEECH SEPARATION USING

DEEP STACKING NETWORK

Hui Zhang

, Xueliang Zhang

, Shuai Nie

, Guanglai Gao

, Wenju Liu

Computer Science Department, Inner Mongolia University, Hohhot, China, 010021

National Laboratory of Patten Recognition (NLPR), Institute of Automation, University of Chinese

Academy of Sciences, Beijing, China, 100190

alzhu.san@163.com, cszxl@imu.edu.cn, nss90221@gmail.com, csggl@imu.edu.cn, lwj@nlpr.ia.ac.cn

ABSTRACT

Pitch information is an important cue for speech separation.

However, pitch estimation in noisy condition is also a task

as challenging as speech separation. In this paper, we

propose a supervised learning architecture which combines

these two problems concisely. The proposed algorithm is

based on deep stacking network (DSN) which provides a

method of stacking simple processing modules in building

deep architecture. In the training stage, an ideal binary mask

is used as target. The input vector includes the outputs

of lower module and frame-level features which consist of

spectral and pitch-based features. In the testing stage, each

module provides an estimated binary mask which is employed

to re-estimate pitch. Then we update the pitch-based features

to the next module. This procedure is embedded iteratively

in DSN, and we obtain the ﬁnal separation results from the

last module of DSN. Systematic evaluations show that the

proposed approach produces high quality estimated binary

mask and outperforms recent systems in generalization.

Index Terms— Speech separation, Pitch estimation,

Computational auditory scene analysis, Supervised learning

1. INTRODUCTION

In realistic environments, noise usually degrades the speech

intelligibility of hearing-impaired listeners or performance

of automatic speech recognition (ASR) systems. Speech

separation aims to remove noise by separating target speech

from background interference. It is helpful for both hearing

aids wearers and ASR systems [1, 2]. Computational auditory

scene analysis (CASA) is a promising method to solve the

speech separation problem [3].

CASA deﬁnes the goal of speech separation as computing

an ideal binary mask (IBM) [4], which is useful for

improving speech intelligibility [5] and the performance

of speech/speaker recognition [6, 7]. The IBM is a time-

frequency (T-F) mask, which can be computed from premixed

This research was supported in part by the China National

Nature Science Foundation (No.61365006, No.61263037, No.61305027,

No.91120303, No.61273267, No.61403370, and No.90820011).

target and interference. Speciﬁcally, in a T-F unit, if the

signal-to-noise ratio (SNR) is greater than a local SNR

criterion (LC), the corresponding mask element in the IBM

is set to 1 (target-dominant). Otherwise, the mask element is

set to 0 (interference-dominant).

When adopting IBM as the computational goal of

CASA, we can naturally formulate the speech separation

as a binary classiﬁcation problem [5]. From the viewpoint

of classiﬁcation, the feature selection is important. Many

features have been inspected. Those features include:

pitch-based features [8], amplitude modulation spectrum

(AMS) [9], relative spectral transform and perceptual linear

prediction (RASTA-PLP), Mel-frequency cepstral coefﬁcient

(MFCC) and Gammatone frequency cepstral coefﬁcient

(GFCC) [10] etc. Wang et al. [10] suggest that pitch-based

features have a good generalization in speech separation.

Pitch-based features are derived from pitch. But

extracting pitch from noisy speech is also a difﬁcult task,

especially in low SNR conditions. Generally speaking, on one

hand, if the target voice is separated from the background,

we can obtain the pitch easily. On the other hand, speech

separation performance will get better if pitch estimation is

accuracy. Since these two tasks could beneﬁt from each other,

speech separation and pitch extraction in noisy conditions are

considered to be a “chicken-and-egg” problem.

In this paper, we propose a supervised learning system to

deal with this “chicken-and-egg” problem more concisely.

• Pitch extraction and speech separation are boosted

alternately. (Section 2.1)

• Frame-level features are adopted, which consist of

spectral features, and pitch-based features. (Section

2.2)

• We use deep stacking network (DSN) to implement our

idea of working on the two problems (pitch extraction

and speech separation) alternately. (Section 2.3)

• Systematic evaluations show the proposed approach

produces high quality estimated binary masks and

outperforms recent systems in unmatched noisy

conditions. (Section 3)

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38705699

粉丝: 3
资源: 962

深度堆叠网络在声学分离与音高估计中的对偶算法

Feature Learning based Deep Supervised Hashing with Pairwise Labels

Pairwise-DeepFm

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Parallel Smith-Waterman Algorithm for Pairwise Sequence Alignment on CPU-GPU heterogeneous platform

A Hybrid Clustering System based, (DE) Algorithm for Clustering:A Hybrid Clustering System based on, (DE) Algorithm for Clustering-matlab开发

A COMPACT PAIRWISE TRAJECTORY REPRESENTATION FOR ACTION RECOGNITION

A Parallel Pairwise Alignment with Pruning for Large Genomic Sequences

Estimate the Intrinsic Dimension of a Metric Space Using the Eigenvalues of the Pairwise Distance Matrix

Complex video event detection via pairwise fusion of trajectory and multi-label hypergraphs

Pairwise subcarriers weighting for suppressing out‐of‐band radiation of OFDM

最新资源