双路径RNN：时间域单通道语音分离的高效长序列建模

需积分: 48 164 浏览量更新于2024-09-03 1 收藏 315KB PDF 举报

"DUAL-PATH RNN 用于时间域单通道语音分离的高效长序列建模" 在深度学习驱动的语音分离领域，时间域方法已展现出优于传统时频域方法的优势。时间域分离系统通常接收包含大量时间步的输入序列，这给建模极长序列带来了挑战。传统的循环神经网络（RNNs）由于优化难题，在处理此类长序列时效率不高，而一维卷积神经网络（1-DCNNs）当其感受野小于序列长度时，无法进行句段级别的序列建模。针对这些问题，论文提出了双路径循环神经网络（DPRNN），这是一种简单但有效的组织RNN层深结构的方法，专门用于建模极长序列。DPRNN的核心在于将RNN层分为两个独立的路径：局部路径和全局路径。局部路径专注于捕捉序列内的短期依赖，而全局路径则负责捕获长距离的上下文信息。这种双路径架构结合了两者的优点，能够更有效地处理时间域中的复杂序列结构。局部路径通常由多个紧密堆叠的RNN层组成，这些层具有相对较小的步长，可以快速地在时间轴上滑动，从而在局部区域内捕捉到动态变化的信息。另一方面，全局路径采用跳跃连接（skip connection）的方式，确保每个RNN层都能接收到整个序列的信息，从而克服了传统RNN中梯度消失或爆炸的问题，使得模型能够处理更长的序列。在DPRNN的设计中，通过交替使用局部路径和全局路径，模型能够在保持计算效率的同时，有效地学习到长序列的上下文依赖。此外，由于DPRNN的并行化特性，它在训练和推理阶段都可以比单纯的RNN或1-DCNN更快，这对于实时或近实时的语音处理任务至关重要。实验结果显示，DPRNN在时间域单通道语音分离任务上表现出色，不仅在主观听觉评估（如SDR、SIR和STOI指标）上优于其他方法，而且在计算效率和模型复杂性方面也有所提升。这表明DPRNN是一种有潜力的解决方案，能够有效应对长时间序列建模的挑战，尤其在单通道语音增强和分离应用中。 DPRNN是一种创新的时间域语音分离技术，通过双路径结构优化了长序列的建模，解决了传统RNN和1-DCNN的局限性，为实际的语音处理和通信应用提供了更优的选择。

DUAL-PATH RNN: EFFICIENT LONG SEQUENCE MODELING FOR

TIME-DOMAIN SINGLE-CHANNEL SPEECH SEPARATION

Yi Luo

†∗

, Zhuo Chen

‡

, Takuya Yoshioka

‡

†

Department of Electrical Engineering, Columbia University, NY, USA

‡

Microsoft, One Microsoft Way, Redmond, WA, USA

ABSTRACT

Recent studies in deep learning-based speech separation have proven

the superiority of time-domain approaches to conventional time-

frequency-based methods. Unlike the time-frequency domain ap-

proaches, the time-domain separation systems often receive input

sequences consisting of a huge number of time steps, which in-

troduces challenges for modeling extremely long sequences. Con-

ventional recurrent neural networks (RNNs) are not effective for

modeling such long sequences due to optimization difﬁculties,

while one-dimensional convolutional neural networks (1-D CNNs)

cannot perform utterance-level sequence modeling when its recep-

tive ﬁeld is smaller than the sequence length. In this paper, we

propose dual-path recurrent neural network (DPRNN), a simple yet

effective method for organizing RNN layers in a deep structure to

model extremely long sequences. DPRNN splits the long sequential

input into smaller chunks and applies intra- and inter-chunk oper-

ations iteratively, where the input length can be made proportional

to the square root of the original sequence length in each operation.

Experiments show that by replacing 1-D CNN with DPRNN and

apply sample-level modeling in the time-domain audio separation

network (TasNet), a new state-of-the-art performance on WSJ0-

2mix is achieved with a 20 times smaller model than the previous

best system.

Index Terms— Speech separation, deep learning, time domain,

recurrent neural networks

1. INTRODUCTION

Recent progress in deep learning-based speech separation has ignited

the interest of the research community in time-domain approaches

[1–6]. Compared with standard time-frequency domain methods,

time-domain methods are designed to jointly model the magnitude

and phase information and allow direct optimization with respect to

both time- and frequency-domain differentiable criteria [7–9].

Current time-domain separation systems can be mainly cate-

gorized into adaptive front-end and direct regression approaches.

The adaptive front-end approaches aim at replacing the short-time

Fourier transform (STFT) with a differentiable transform to build

a front-end that can be learned jointly with the separation network.

Separation is applied to the front-end output as with the conventional

time-frequency domain methods applying the separation processes

to spectrogram inputs [3–5]. Being independent of the traditional

time-frequency analysis paradigm, these systems are able to have a

much more ﬂexible choice on the window size and the number of ba-

sis functions for the front-end. On the other hand, the direct regres-

sion approaches learn a regression function from an input mixture

∗

Work done during internship at Microsoft Research.

to the underlying clean signals without an explicit front-end, typi-

cally by using some form of one-dimensional convolutional neural

networks (1-D CNNs) [2, 7, 10].

A commonality between the two categories is that they both rely

on effective modeling of extremely long input sequences. The di-

rect regression methods perform separation at the waveform sam-

ple level, while the number of the samples can usually be tens of

thousands, or sometimes even more. The performance of the adap-

tive front-end methods also depend on selection of the window size,

where a smaller window improves the separation performance at the

cost of a signiﬁcantly longer front-end representation [4, 11]. This

poses an additional challenge as conventional sequential modeling

networks, including RNNs and 1-D CNNs, have difﬁculty on learn-

ing such long-term temporal dependency [12]. Moreover, unlike

RNNs that have dynamic receptive ﬁelds, 1-D CNNs with ﬁxed re-

ceptive ﬁelds that are smaller than the sequence length are not able

to fully utilize the sequence-level dependency [13].

In this paper, we propose a simple network architecture, which

we refer to as dual-path RNN (DPRNN), that organizes any kinds of

RNN layers to model long sequential inputs in a very simple way.

The intuition is to split the input sequence into shorter chunks and

interleave two RNNs, an intra-chunk RNN and an inter-chunk RNN,

for local and global modeling, respectively. In a DPRNN block, the

intra-chunk RNN ﬁrst processes the local chunks independently, and

then the inter-chunk RNN aggregates the information from all the

chunks to perform utterance-level processing. For a sequential input

of length L, DPRNN with chunk size K and chunk hop size P con-

tains S chunks, where K and S corresponds to the input lengths for

the inter- and intra-chunk RNNs, respectively. When K ≈ S, the

two RNNs have a sublinear input length (O(

√

L)) as opposed to the

original input length (O(L)), which greatly decreases the optimiza-

tion difﬁculty that arises when L is extremely large.

Compared with other approaches for arranging local and global

RNN layers, or more general the hierarchical RNNs that perform

sequence modeling in multiple time scales [14–19], the stacked

DPRNN blocks iteratively and alternately perform the intra- and

inter-chunk operations, which can be treated as an interleaved pro-

cessing between local and global inputs. Moreover, the ﬁrst RNN

layer in most hierarchical RNNs still receives the entire input se-

quence, while in stacked DPRNN each intra- or inter-chunk RNN

receives the same sublinear input size across all blocks. Compared

with CNN-based architectures such as temporal convolutional net-

works (TCNs) that only perform local modeling due to the ﬁxed

receptive ﬁelds [4, 5, 20], DPRNN is able to fully utilize global

information via the inter-chunk RNNs and achieve superior per-

formance with an even smaller model size. In Section 4 we will

show that by simply replacing TCN by DPRNN in a previously

proposed time-domain separation system [4], the model is able to

arXiv:1910.06379v2 [eess.AS] 27 Mar 2020

下载后可阅读完整内容，剩余4页未读，立即下载

安安爸Chris

粉丝: 9800
资源: 17

双路径RNN：时间域单通道语音分离的高效长序列建模

Dual-path RNN： Pytorch实现的时域单通道语音分离的 高效长序列建模

Gated DualPathRNN译文.docx

dual-path RNN相比于conv-tasnet的改进,以及dual-path RNN的优缺点

dual-path rnn代码

DUAL-PATH RNN

RNN时间序列预测插值

RNN 和 LSTM-RNN的区别

绘制下DA-RNN的网络模型结构

DA-RNN的优点是什么

最新资源

Dual-path RNN： Pytorch实现的时域单通道语音分离的高效长序列建模