Learning Music Emotion Primitives via Supervised
Dynamic Clustering
Yang Liu
1,2
, Yan Liu
3
, Xiang Zhang
3,4
, Gong Chen
3
, Kejun Zhang
4
1
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR, P. R. China
2
Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, P. R. China
3
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong SAR, P. R. China
4
College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, P. R. China
csygliu@comp.hkbu.edu.hk, csyliu@comp.polyu.edu.hk, csxgzhang@comp.polyu.edu.hk
csgchen@comp.polyu.edu.hk, zhangkejun@zju.edu.cn
ABSTRACT
This paper explores a fundamental problem in music emotion anal-
ysis, i.e., how to segment the music sequence into a set of basic
emotive units, which are named as emotion primitives. Current
works on music emotion analysis are mainly based on the fixed-
length music segments, which often leads to the difficulty of ac-
curate emotion recognition. Short music segment, such as an in-
dividual music frame, may fail to evoke emotion response. Long
music segment, such as an entire song, may convey various emo-
tions over time. Moreover, the minimum length of music segment
varies depending on the types of the emotions. To address these
problems, we propose a novel method dubbed supervised dynamic
clustering (SDC) to automatically decompose the music sequence
into meaningful segments with various lengths. First, the music
sequence is represented by a set of music frames. Then, the mu-
sic frames are clustered according to the valence-arousal values in
the emotion space. The clustering results are used to initialize the
music segmentation. After that, a dynamic programming scheme
is employed to jointly optimize the subsequent segmentation and
grouping in the music feature space. Experimental results on stan-
dard dataset show both the effectiveness and the rationality of the
proposed method.
Keywords
Music emotion analysis; emotion primitives; supervised dynamic
clustering
1. INTRODUCTION
Music, laxly explained as the organized sound, can convey and
evoke various emotions. Music emotion analysis, which attracts
much attention of researchers from various disciplines such as mu-
sicology [4], psychology [11], and computer science [27], plays a
crucial role in many real-world applications such as music recom-
mendation [24] and music therapy [18].
With the huge amount of available musical data and the rapid
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’16, October 15-19, 2016, Amsterdam, Netherlands
c
2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00
DOI: http://dx.doi.org/10.1145/2964284.2967215
development of computer resources, computational approaches to
automate the process of music emotion analysis have taken on in-
creasing interests and importance [15, 29, 22, 24, 25, 12]. Lu et al.
[15] proposed a GMM-based model to detect the moods in music.
Yang et al. [29] presented a regression approach for music emotion
recognition. Trohidis et al. [22] formulated music emotion analy-
sis as a multi-label classification problem. Wang et al. [24] built a
probabilistic model for music recommendation. Wu et al. [25] pro-
posed a multi-label multi-layer multi-instance multi-view learning
scheme for music emotion recognition. Liu et al. [12] introduced a
dimensionality reduction algorithm to model the relations between
low-level music features and high-level emotions.
Although tremendous progress has been made in music emotion
analysis, one of the fundamental problems, i.e., how to segment the
music sequence into a set of plausible units according to the emo-
tions, is seldom investigated. The inherent difficulty in the problem
mainly stems from the variety and complexity of music emotion
labels, a relatively large range of temporal scale for different music
emotions, and the intra-emotion variation of the music sequences.
In this paper, we work on this fundamental problem. We name
these basic units as music emotion primitives and explore machine
learning techniques to learn these primitives from the music data
with human annotated emotions. To address aforementioned chal-
lenges, a novel computational model dubbed supervised dynamic
clustering (SDC) is presented to jointly optimize the segmenta-
tion and clustering of music sequences under the supervision of
the emotion information.
To the best of our knowledge, this is the first work to decompose
the music sequences into plausible emotion primitives, although
learning motion primitives for visual data has already achieved sig-
nificant progress [6, 9, 13, 14, 30]. One latest related work is [7],
which also modeled the dynamics of music emotions over time.
However, the objective and methodology in [7] are totally differ-
ent from those in our paper. The work in [7] aimed at performing
emotion-based music retrieval via dynamic time warping [2], while
the target of our work is to discover the emotion primitives of the
music by the proposed SDC.
The rest of this paper is organized as follows. In Section 2, we
propose SDC to learn the music emotion primitives. In Section 3,
we schematically illustrate the learning outcomes and statistically
evaluate the performance of proposed method on standard dataset.
Finally, the paper is concluded with future work in Section 4.
2. SUPERVISED DYNAMIC CLUSTERING
Most of the computational models for music emotion analysis
are based on two kinds of emotion representations: the categorical