局部优化深度聚类法：单通道语音分离新策略

171 浏览量更新于2024-08-26 1 收藏 653KB PDF 举报

"基于局部优化的深度聚类的单通道语音分离" 这篇研究论文"基于局部优化的深度聚类的单通道语音分离"聚焦于解决单通道多人混合语音分离中的复杂问题，这一领域面临的挑战包括如何建模语音信号的时间连续性和同时提升帧分离性能。论文提出了一种新的方法，该方法结合了深度聚类、改进的非负矩阵分解（NMF）以及因子条件随机场（FCRF），以实现局部优化。首先，论文介绍的深度聚类模型由双向长短期记忆网络（BLSTM）训练，利用相似性特征对语音进行聚类，从而初步实现语音的分离。BLSTM是一种递归神经网络，能够捕获序列数据的前向和后向上下文信息，对于理解和处理语音信号的时间依赖性特别有效。接着，通过引入改进的NMF，对分离出的语音进行局部优化。NMF是一种无监督学习方法，常用于信号分解和特征提取，而改进的NMF（可能是通过K-means++算法增强）则能更好地适应语音信号的特性，提高分离效果。K-means++是一种初始化聚类中心的策略，能避免K-means聚类陷入局部最优，从而提高聚类质量。然后，论文进一步结合因子条件随机场（FCRF）进行迭代优化。FCRF是一种概率图模型，可以捕捉变量之间的条件依赖关系，尤其适用于语音信号的建模，因为它能处理帧间依赖性，有助于提升帧级的分离性能。实验结果显示，该算法显著提高了语音分离的性能，这表明提出的深度聚类与局部优化策略在处理单通道多人混合语音时具有显著优势。这种方法对于实际应用，如会议录音、语音识别和语音增强等领域具有重要意义，能够提高系统的语音处理能力，特别是在噪声环境中。这篇研究论文为单通道语音分离提供了一个创新的解决方案，结合了深度学习与统计建模的力量，通过局部优化提升了语音分离的准确性和连续性，有望推动语音处理技术的进步。

Single-Channel Speech Separation Based on Deep Clustering with Local

Optimization

Taotao Fu

1,2

, Ge Yu

, Lili Guo

, Yan Wang

, Ji Liang

Key Laboratory of Space Utilization, Technology and Engineering Center for

Space Utilization, Chinese Academy of Sciences

University of Chinese Academy of Sciences

Beijing, China

e-mail: futaotao2@foxmail.com

Abstract—There are many challenges in single-channel multi-

person mixed speech separation, such as modeling the

temporal continuity of the speech signals and improving the

frame separation performance simultaneously. In this paper, a

separation method based on Deep Clustering with local

optimization by the improved Non-Negative Matrix

Factorization (NMF) combined with Factorial Conditional

Random Fields (FCRF) is proposed. Primarily, the separated

voices are achieved by Deep Clustering model which are

trained by the Bi-directional Long Short Term Memory

(BLSTM) and clustered by the similar features. Then,

separated voice are locally optimized by the improved NMF

with K-means++ and FCRF iteratively. The results show the

algorithm improves the separation performance, which

satisfies both the local optimum of the speech signal on each

frame and the continuity of the whole speech signal.

Keywords-single-channel speech separation; K-means++;

deep clustering; NMF; FCRF

I. INTRODUCTION

The Blind Speech Separation (BSS) is a challenging

research subject and becomes a popular research area in

signal processing field in recent years. It derives from

“cocktail-party” [1] problem that effectively identifies a

particular speaker's voice in a noisy environment. Many

algorithms addressing the instantaneous BSS problems have

been proposed but the current research results show the

problem are far from being well resolved, especially in the

single-channel speech separation, in which two or more

individual speech signals should be separated from a single-

channel signal. Due to the underdetermined problem, single-

channel speech separation is much more difficult than the

problem with multi-input signals. With the wide application

in automatic speech recognition and music transcription etc.,

the signal channel speech separation becomes a new research

hotspot in speech signal processing.

A variety of methods have been applied to single-channel

speech separation, including non-negative matrix

decomposition (NMF) [2], computational auditory scene

analysis (CASA) [3], factorial hidden Markov model

(FHMM) [4] and Deep Clustering (DC) [5].

In NMF, the mixed speech is separated by utilizing

voice’s non-negative characteristics appropriately, while the

temporal continuity of the speech signal can’t be well

modeled because of the assumption that the adjacent frames

of voice signals are independent [2]. FHMM is often used for

continuous modeling the process of the speech mixture [4].

Mysore [6] proposed a non-negative hidden Markov method

to model the temporal continuity of speech by combining

NMF and HMM, and also a non-negative factorial hidden

Markov model is proposed to separate the mixed speech

signals of two speakers. Li [7] present an algorithm based on

NMF and FCRF to describe the spectral structure and the

temporal continuity of the speech signal. The problem of

“cocktail-party” source separation in a deep learning

framework called deep clustering. Hershey [5] proposed a

deep network-based analogue for spectral clustering to

achieve a better separation result than that under the methods

of NMF combined with FCRF and HMM in Signal to

Distortion Ratio (SDR). Compared with the NMF combined

with FCRF, the Deep Clustering may lead to the

discontinuity by the lack of temporal continuity modeling of

the speech signals.

In this paper, a new single-channel speech separation

method is proposed based on Deep Clustering with local

optimization to achieve a better local separation and reduces

the distortion of speech. This paper is organized as follows.

The Deep Clustering model for single-channel speech

separation is established by using BLSTM and then the

separated voice signals are obtained by clustering the

BLSTM output firstly. Secondly, the separated voice signals

are locally optimized by the improved NMF with K-

means++ clustering and FCRF, with iterative process. At last,

several experiments are performed to validate our proposed

method.

II. S

PEECH SEPARATION BASED ON DEEP CLUSTERING

A. Deep Clustering

We define the mixed source signal and as

. Then the

Short Time Fourier Transform (STFT) process is employed

to get the signal spectrum as

,tf

, where

indexed the

2017 3rd International Conference on Frontiers of Signal Processing

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38535812

粉丝: 5
资源: 986

局部优化深度聚类法：单通道语音分离新策略

deep-clustering:单通道语音分离的深度聚类方法

基于深度神经网络的语音分离算法

深度聚类在语音分离中的原理与实现

深度聚类算法在实时语音分离系统中的应用

深度聚类算法在语音信号中的优化与调参策略

迁移学习在深度聚类语音分离模型中的实验研究

深度聚类与深度神经网络的结合在语音分离中的效果

基于深度聚类的语音分离代码

基于深度学习的聚类关键技术研究

基于K-mean的深度聚类

最新资源