INTRODUCTION
Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks
with a Novel Image-Based Representation
Ching-Hua Chuan
1, 2
1
University of North Florida
2
University of Miami
c.chuan@miami.edu
Dorien Herremans
3, 4
3
Singapore University of Technology and Design
4
Institute of High Performance Computing, A*STAR, Singapore
dorien_herremans@sutd.edu.sg
Abstract
We propose an end-to-end approach for modeling polyphonic
music with a novel graphical representation, based on music
theory, in a deep neural network. Despite the success of deep
learning in various applications, it remains a challenge to in-
corporate existing domain knowledge in a network without
affecting its training routines. In this paper we present a novel
approach for predictive music modeling and music generation
that incorporates domain knowledge in its representation. In
this work, music is transformed into a 2D representation, in-
spired by tonnetz from music theory, which graphically en-
codes musical relationships between pitches. This represen-
tation is incorporated in a deep network structure consist-
ing of multilayered convolutional neural networks (CNN, for
learning an efficient abstract encoding of the representation)
and recurrent neural networks with long short-term memory
cells (LSTM, for capturing temporal dependencies in music
sequences). We empirically evaluate the nature and the effec-
tiveness of the network by using a dataset of classical mu-
sic from various composers. We investigate the effect of pa-
rameters including the number of convolution feature maps,
pooling strategies, and three configurations of the network:
LSTM without CNN, LSTM with CNN (pre-trained vs. not
pre-trained). Visualizations of the feature maps and filters in
the CNN are explored, and a comparison is made between
the proposed tonnetz-inspired representation and pianoroll,
a commonly used representation of music in computational
systems. Experimental results show that the tonnetz represen-
tation produces musical sequences that are more tonally sta-
ble and contain more repeated patterns than sequences gen-
erated by pianoroll-based models, a finding that is directly
useful for tackling current challenges in music and AI such
as smart music generation.
Introduction
Predictive models of music have been explored by re-
searchers since the very beginning of the field of computer
music (Brooks et al. 1957). Such models are useful for ap-
plications in music analysis (Qi, Paisley, and Carin 2007);
music cognition (Schellenberg 1996); improvement of tran-
scription systems (Sigtia, Benetos, and Dixon 2016); music
generation (Herremans et al. 2015); and others. Applications
such as the latter represent various fundamental challenges
Copyright
c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
in artificial intelligence for music. In recent years, there has
been a growing interest in deep neural networks for model-
ing music due to their power to capture complex hidden re-
lationships. The launch of recent projects such as Magenta, a
deep learning and music project with a focus on music gen-
eration by the Google Brain team, testify to the importance
and recent popularity of music and AI. With this project we
aim to further advance the capability of deep networks to
model music by proposing a novel image based representa-
tion inspired by music theory.
Recent deep learning projects in the field of music include
Eck and Schmidhuber (2002), in which a recurrent neural
network (RNN) with LSTM cells is used to generate impro-
visations (first chord sequences, followed by the generation
of monophonic melodies) for 12-bar blues. They represent
music as notes whose pitches fall in a range of 25 possi-
ble pitches (C
3
to C
5
) and that occur at fixed time intervals.
Therefore, the network has 25 outputs that are each con-
sidered independently. A decision threshold of 0.5 is used
to select each note as a statistically independent events in a
chord. More recently, a pianoroll representation of 88 keys
has been used to train a RNN by Boulanger-Lewandowski,
Bengio, and Vincent (2012). The authors integrate the notion
of chords by using restricted Boltzmann machines on top of
an RNN to model the conditioned distribution of simultane-
ously played notes in the next time slice, given the previous
time slice. In Huang, Duvenaud, and Gajos (2016), a chord
sequence is modelled as a string of symbols. Chord em-
beddings are learned from a corpus using Word2vec based
on the skip-gram model (Mikolov et al. 2013), to describe
a chord according to its sequential context. A Word2vec
approach is also used in Herremans and Chuan (2017) to
model and generate polyphonic music. For a more complete
overview of music generation systems, the reader is referred
to Herremans, Chuan, and Chew (2017). While music can
typically be represented in either audio or symbolic format,
the focus of this paper is on the latter.
The widely spread adoption of deep learning in areas such
as image recognition is due to the high accuracy of models
given the availability of abundant data, and its end-to-end
solution to eliminate the need of hand-crafted features. Mu-
sic, however, is a domain where well-annotated datasets are
relatively scarce, but which has a long history of theoretical
deliberation. It is therefore important to explore how such
Preprint accepted for publication in the Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
(AAAI-18). New Orleans, Louisiana, USA. Feb. 2018
1