没有合适的资源?快使用搜索试试~ 我知道了~
首页A Review of Time-Scale Modification of Music Signals
A Review of Time-Scale Modification of Music Signals
需积分: 15 271 浏览量
更新于2023-03-16
评论
收藏 1.59MB PDF 举报
A Review of Time-Scale Modification of Music Signals
资源详情
资源评论
资源推荐

applied
sciences
Review
A Review of Time-Scale Modification of
Music Signals
†
Jonathan Driedger *
,‡
and Meinard Müller *
,‡
International Audio Laboratories Erlangen, 91058 Erlangen, Germany
* Correspondence: jonathan.driedger@audiolabs-erlangen.de (J.D.);
meinard.mueller@audiolabs-erlangen.de(M.M.); Tel.: +49-913-185-20519 (J.D.); +49-913-185-20504 (M.M.);
Fax: +49-913-185-20524 (J.D. & M.M.)
†
This paper is an extended version of our paper published in the Proceedings of the International Conference
on Digital Audio Effects (DAFx), Erlangen, Germany, 1–5 September 2014.
‡ These authors contributed equally to this work.
Academic Editor: Vesa Valimaki
Received: 22 December 2015; Accepted: 25 January 2016; Published: 18 February 2016
Abstract:
Time-scale modification (TSM) is the task of speeding up or slowing down an audio
signal’s playback speed without changing its pitch. In digital music production, TSM has become
an indispensable tool, which is nowadays integrated in a wide range of music production software.
Music signals are diverse—they comprise harmonic, percussive, and transient components, among
others. Because of this wide range of acoustic and musical characteristics, there is no single TSM
method that can cope with all kinds of audio signals equally well. Our main objective is to foster a
better understanding of the capabilities and limitations of TSM procedures. To this end, we review
fundamental TSM methods, discuss typical challenges, and indicate potential solutions that combine
different strategies. In particular, we discuss a fusion approach that involves recent techniques for
harmonic-percussive separation along with time-domain and frequency-domain TSM procedures.
Keywords:
digital signal processing; overlap-add; WSOLA; phase vocoder; harmonic-percussive
separation; transient preservation; pitch-shifting; music synchronization
1. Introduction
Time-scale modification (TSM) procedures are digital signal processing methods for stretching or
compressing the duration of a given audio signal. Ideally, the time-scale modified signal should sound
as if the original signal’s content was performed at a different tempo while preserving properties like
pitch and timbre. TSM procedures are applied in a wide range of scenarios. For example, they simplify
the process of creating music remixes. Music producers or DJs apply TSM to adjust the durations of
music recordings, enabling synchronous playback [
1
,
2
]. Nowadays TSM is built into music production
software as well as hardware devices. A second application scenario is adjusting an audio stream’s
duration to that of a given video clip. For example, when generating a slow motion video, it is often
desirable to also slow down the tempo of the associated audio stream. Here, TSM can be used to
synchronize the audio material with the video’s visual content [3].
A main challenge for TSM procedures is that music signals are complex sound mixtures, consisting
of a wide range of different sounds. As an example, imagine a music recording consisting of a violin
playing together with castanets. When modifying this music signal with a TSM procedure, both the
harmonic sound of the violin as well as the percussive sound of the castanets should be preserved
in the output signal. To keep the violin’s sound intact, it is essential to maintain its pitch as well as
its timbre. On the other hand, the clicking sound of the castanets does not have a pitch—it is much
more important to maintain the crisp sound of the single clicks, as well as their exact relative time
Appl. Sci. 2016, 6, 57; doi:10.3390/app6020057 www.mdpi.com/journal/applsci

Appl. Sci. 2016, 6, 57 2 of 26
positions, in order to preserve the original rhythm. Retaining these contrasting characteristics usually
requires conceptually different TSM approaches. For example, classical TSM procedures based on
waveform similarity overlap-add (WSOLA) [
4
] or on the phase vocoder (PV-TSM) [
5
–
7
] are capable
of preserving the perceptual quality of harmonic signals to a high degree, but introduce noticeable
artifacts when modifying percussive signals. However, it is possible to substantially reduce artifacts by
combining different TSM approaches. For example, in [
8
], a given audio signal is first separated into
a harmonic and a percussive component. Afterwards, each component is processed with a different
TSM procedure that preserves its respective characteristics. The final output signal is then obtained by
superimposing the two intermediate output signals.
Our goals in this article are two-fold. First, we aim to foster an understanding of fundamental
challenges and algorithmic approaches in the field of TSM by reviewing well-known TSM methods
and discussing their respective advantages and drawbacks in detail. Second, having identified the
core issues of these classical procedures, we show—through an example—how to improve on them
by combining different algorithmic ideas. We begin the article by introducing a fundamental TSM
strategy as used in many TSM procedures (Section 2) and discussing a simple TSM approach based
on overlap-add (Section 3). Afterwards, we review two conceptually different TSM methods: the
time-domain WSOLA (Section 4) as well as the frequency-domain PV-TSM (Section 5). We then review
the state-of-the-art TSM procedure from [
8
] that improves on the quality of both WSOLA as well as
PV-TSM by incorporating harmonic-percussive separation (Section 6). Finally, we point out different
application scenarios for TSM (such as music synchronization and pitch-shifting), as well as various
freely available TSM implementations (Section 7).
2. Fundamentals of Time-Scale Modification (TSM)
As mentioned above, a key requirement for time-scale modification procedures is that they change
the time-scale of a given audio signal without altering its pitch content. To achieve this goal, many
TSM procedures follow a common fundamental strategy which is sketched in Figure 1. The core idea
is to decompose the input signal into short frames. Having a fixed length, usually in the range of
50 to 100 milliseconds
of audio material, each frame captures the local pitch content of the signal. The
frames are then relocated on the time axis to achieve the actual time-scale modification, while, at the
same time, preserving the signal’s pitch.
Signal reconstruction
Original signal Analysis frames Synthesis frames
Time-scale modified
signal
Signal decomposition
Frame relocation &
adaption
Analysis
hopsize
Synthesis
hopsize
Figure 1. Generalized processing pipeline of Time-scale modification (TSM) procedures.
More precisely, this process can be described as follows. The input of a TSM procedure is a
discrete-time audio signal
x : Z → R
, equidistantly sampled at a sampling rate of
F
s
. Note that
although audio signals typically have a finite length of
L ∈ N
samples
x(r)
for
r ∈ [
0
: L −
1
] :=
{
0, 1,
. . .
,
L −
1
}
, for the sake of simplicity, we model them to have an infinite support by defining
x(r) =
0 for
r ∈ Z \ [
0
: L −
1
]
. The first step of the TSM procedure is to split
x
into short analysis

Appl. Sci. 2016, 6, 57 3 of 26
frames
x
m
,
m ∈ Z
, each of them having a length of
N
samples (in the literature, the analysis frames are
sometimes also referred to as grains, see [
9
]). The analysis frames are spaced by an analysis hopsize
H
a
:
x
m
(r) =
(
x(r + mH
a
), if r ∈ [−N/2 : N/2 − 1],
0, otherwise.
(1)
In a second step, these frames are relocated on the time axis with regard to a specified synthesis
hopsize
H
s
. This relocation accounts for the actual modification of the input signal’s time-scale by
a stretching factor
α = H
s
/H
a
. Since it is often desirable to have a specific overlap of the relocated
frames, the synthesis hopsize
H
s
is often fixed (common choices are
H
s
= N/
2 or
H
s
= N/
4) while the
analysis hopsize is given by
H
a
= H
s
/α
. However, simply superimposing the overlapping relocated
frames would lead to undesired artifacts such as phase discontinuities at the frame boundaries and
amplitude fluctuations. Therefore, prior to signal reconstruction, the analysis frames are suitably
adapted to form synthesis frames
y
m
. In the final step, the synthesis frames are superimposed in order
to reconstruct the actual time-scale modified output signal y : Z → R of the TSM procedure:
y(r) =
∑
m∈Z
y
m
(r − mH
s
) . (2)
Although this fundamental strategy seems straightforward at a first glance, there are many pitfalls
and design choices that may strongly influence the perceptual quality of the time-scale modified
output signal. The most obvious question is how to adapt the analysis frames
x
m
in order to form the
synthesis frames
y
m
. There are many ways to approach this task, leading to conceptually different
TSM procedures. In the following, we discuss several strategies.
3. TSM Based on Overlap-Add (OLA)
3.1. The Procedure
In the general scheme described in the previous section, a straightforward approach would
be to simply define the synthesis frames
y
m
to be equal to the unmodified analysis frames
x
m
.
This strategy
, however, immediately leads to two problems which are visualized in Figure 2.
First, when
reconstructing the output signal by using Equation
(2)
, the resulting waveform typically
shows discontinuities—perceivable as clicking sounds—at the unmodified frames’ boundaries.
Second, the
synthesis hopsize
H
s
is usually chosen such that the synthesis frames are overlapping.
When superimposing the unmodified frames—each of them having the same amplitude as the input
signal—this typically leads to an undesired increase of the output signal’s amplitude.

Appl. Sci. 2016, 6, 57 4 of 26
0.1
Time (s)
0 0.05 0.1
0 0.05 0.15 0.18
0
-1
1
0
-1
1
𝑦
𝑥
Figure 2.
Typical artifacts that occur when choosing the synthesis frames
y
m
to be equal to the
analysis frames
x
m
. The input signal
x
is stretched by a factor of
α =
1.8. The output signal
y
shows
discontinuities (blue oval) and amplitude fluctuations (indicated by blue lines).
A basic TSM procedure should both enforce a smooth transition between frames as well as
compensate for unwanted amplitude fluctuations. The idea of the overlap-add (OLA) TSM procedure is
to apply a window function
w
to the analysis frames, prior to the reconstruction of the output signal
y
.
The task of the window function is to remove the abrupt waveform discontinuities at the the analysis
frames’ boundaries. A typical choice for w is a Hann window function
w(r) =
0.5
1 − cos
2π(r+N/2)
N−1
, if r ∈ [−N/2 : N/2 − 1],
0, otherwise.
(3)
The Hann window has the nice property that
∑
n∈Z
w
r − n
N
2
= 1 , (4)
for all
r ∈ Z
. The principle of the iterative OLA procedure is visualized in Figure 3. For the frame
index
m ∈ Z
, we first use Equation
(1)
to compute the
m
th
analysis frame
x
m
(Figure 3a). Then, we
derive the synthesis frame y
m
by
y
m
(r) =
w(r) x
m
(r)
∑
n∈Z
w(r − nH
s
)
. (5)
The nominator of Equation
(5)
constitutes the actual windowing of the analysis frame by
multiplying it pointwise with the given window function. The denominator normalizes the frame by
the sum of the overlapping window functions, which prevents amplitude fluctuations in the output
signal.
Note that
, when choosing
w
to be a Hann window and
H
s
= N/
2, the denominator always
reduces to one by Equation
(4)
. This is the case in Figure 3b where the synthesis frame’s amplitude
is not scaled before being added to the output signal
y
. Proceeding to the next analysis frame
x
m+1
,
(Figure 3c), this frame is again windowed, overlapped with the preceding synthesis frame, and added
to the output signal (Figure 3d). Note that Figure 3 visualizes the case where the original signal is
compressed (
H
a
> H
s
). Stretching the signal (
H
a
< H
s
) works in exactly the same fashion. In this case,
the analysis frames overlap to a larger degree than the synthesis frames.

Appl. Sci. 2016, 6, 57 5 of 26
𝐻
s
Kategorie 1: 145 EuroKategorie 1: 145
Euro
Kategorie 1: 145 EuroKategorie 1: 145
Euro
Kategorie 1: 145 EuroKategorie 1: 145
Euro
𝑥
𝑦
𝑥
𝑚
(a)
𝑥
𝑦
(b)
𝑦
𝑚
𝑤
𝑥
𝑦
(c)
𝑦
𝑚
𝑥
𝑚
Time
Time
𝑥
𝑚
𝑥
𝑚+1
𝐻
a
𝑥
𝑦
(d)
𝑦
𝑚
𝑥
𝑚
𝑥
𝑚+1
𝑦
𝑚+1
Figure 3.
The principle of TSM based on overlap-add (OLA). (
a
) Input audio signal
x
with analysis
frame
x
m
. The output signal
y
is constructed iteratively; (
b
) Application of Hann window function
w
to the analysis frame
x
m
resulting in the synthesis frame
y
m
; (
c
) The next analysis frame
x
m+1
having a
specified distance of H
a
samples from x
m
; (d) Overlap-add using the specified synthesis hopsize H
s
.
OLA is an example of a time-domain TSM procedure where the modifications to the analysis
frames are applied purely in the time-domain. In general, time-domain TSM procedures are not only
efficient but also preserve the timbre of the input signal to a high degree. On the downside, output
signals produced by OLA often suffer from other artifacts, as we explain next.
3.2. Artifacts
The OLA procedure is in general not capable of preserving local periodic structures that are
present in the input signal. This is visualized in Figure 4 where a periodic input signal
x
is stretched by
a factor of
α =
1.8 using OLA. When relocating the analysis frames, the periodic structures of
x
may not
align any longer in the superimposed synthesis frames. In the resulting output signal
y
, the periodic
patterns are distorted. These distortions are also known as phase jump artifacts.
Since local
periodicities
in the waveforms of audio signals correspond to harmonic sounds, OLA is not suited to modify signals
that contain harmonic components. When applied to harmonic signals, the output signals of OLA
have a characteristic warbling sound, which is a kind of periodic frequency modulation [
6
]. Since most
music signals contain at least some harmonic sources (as for example singing voice, piano, violins, or
guitars), OLA is usually not suited to modify music.
剩余25页未读,继续阅读

















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0