DUAL-PATH RNN: EFFICIENT LONG SEQUENCE MODELING FOR
TIME-DOMAIN SINGLE-CHANNEL SPEECH SEPARATION
Yi Luo
†∗
, Zhuo Chen
‡
, Takuya Yoshioka
‡
†
Department of Electrical Engineering, Columbia University, NY, USA
‡
Microsoft, One Microsoft Way, Redmond, WA, USA
ABSTRACT
Recent studies in deep learning-based speech separation have proven
the superiority of time-domain approaches to conventional time-
frequency-based methods. Unlike the time-frequency domain ap-
proaches, the time-domain separation systems often receive input
sequences consisting of a huge number of time steps, which in-
troduces challenges for modeling extremely long sequences. Con-
ventional recurrent neural networks (RNNs) are not effective for
modeling such long sequences due to optimization difficulties,
while one-dimensional convolutional neural networks (1-D CNNs)
cannot perform utterance-level sequence modeling when its recep-
tive field is smaller than the sequence length. In this paper, we
propose dual-path recurrent neural network (DPRNN), a simple yet
effective method for organizing RNN layers in a deep structure to
model extremely long sequences. DPRNN splits the long sequential
input into smaller chunks and applies intra- and inter-chunk oper-
ations iteratively, where the input length can be made proportional
to the square root of the original sequence length in each operation.
Experiments show that by replacing 1-D CNN with DPRNN and
apply sample-level modeling in the time-domain audio separation
network (TasNet), a new state-of-the-art performance on WSJ0-
2mix is achieved with a 20 times smaller model than the previous
best system.
Index Terms— Speech separation, deep learning, time domain,
recurrent neural networks
1. INTRODUCTION
Recent progress in deep learning-based speech separation has ignited
the interest of the research community in time-domain approaches
[1–6]. Compared with standard time-frequency domain methods,
time-domain methods are designed to jointly model the magnitude
and phase information and allow direct optimization with respect to
both time- and frequency-domain differentiable criteria [7–9].
Current time-domain separation systems can be mainly cate-
gorized into adaptive front-end and direct regression approaches.
The adaptive front-end approaches aim at replacing the short-time
Fourier transform (STFT) with a differentiable transform to build
a front-end that can be learned jointly with the separation network.
Separation is applied to the front-end output as with the conventional
time-frequency domain methods applying the separation processes
to spectrogram inputs [3–5]. Being independent of the traditional
time-frequency analysis paradigm, these systems are able to have a
much more flexible choice on the window size and the number of ba-
sis functions for the front-end. On the other hand, the direct regres-
sion approaches learn a regression function from an input mixture
∗
Work done during internship at Microsoft Research.
to the underlying clean signals without an explicit front-end, typi-
cally by using some form of one-dimensional convolutional neural
networks (1-D CNNs) [2, 7, 10].
A commonality between the two categories is that they both rely
on effective modeling of extremely long input sequences. The di-
rect regression methods perform separation at the waveform sam-
ple level, while the number of the samples can usually be tens of
thousands, or sometimes even more. The performance of the adap-
tive front-end methods also depend on selection of the window size,
where a smaller window improves the separation performance at the
cost of a significantly longer front-end representation [4, 11]. This
poses an additional challenge as conventional sequential modeling
networks, including RNNs and 1-D CNNs, have difficulty on learn-
ing such long-term temporal dependency [12]. Moreover, unlike
RNNs that have dynamic receptive fields, 1-D CNNs with fixed re-
ceptive fields that are smaller than the sequence length are not able
to fully utilize the sequence-level dependency [13].
In this paper, we propose a simple network architecture, which
we refer to as dual-path RNN (DPRNN), that organizes any kinds of
RNN layers to model long sequential inputs in a very simple way.
The intuition is to split the input sequence into shorter chunks and
interleave two RNNs, an intra-chunk RNN and an inter-chunk RNN,
for local and global modeling, respectively. In a DPRNN block, the
intra-chunk RNN first processes the local chunks independently, and
then the inter-chunk RNN aggregates the information from all the
chunks to perform utterance-level processing. For a sequential input
of length L, DPRNN with chunk size K and chunk hop size P con-
tains S chunks, where K and S corresponds to the input lengths for
the inter- and intra-chunk RNNs, respectively. When K ≈ S, the
two RNNs have a sublinear input length (O(
√
L)) as opposed to the
original input length (O(L)), which greatly decreases the optimiza-
tion difficulty that arises when L is extremely large.
Compared with other approaches for arranging local and global
RNN layers, or more general the hierarchical RNNs that perform
sequence modeling in multiple time scales [14–19], the stacked
DPRNN blocks iteratively and alternately perform the intra- and
inter-chunk operations, which can be treated as an interleaved pro-
cessing between local and global inputs. Moreover, the first RNN
layer in most hierarchical RNNs still receives the entire input se-
quence, while in stacked DPRNN each intra- or inter-chunk RNN
receives the same sublinear input size across all blocks. Compared
with CNN-based architectures such as temporal convolutional net-
works (TCNs) that only perform local modeling due to the fixed
receptive fields [4, 5, 20], DPRNN is able to fully utilize global
information via the inter-chunk RNNs and achieve superior per-
formance with an even smaller model size. In Section 4 we will
show that by simply replacing TCN by DPRNN in a previously
proposed time-domain separation system [4], the model is able to
arXiv:1910.06379v2 [eess.AS] 27 Mar 2020