Automatic Emotion Variation Detection in
Continuous Speech
Yuchao Fan, Mingxing Xu, Zhiyong Wu, Lianhong Cai
Key Laboratory of Pervasive Computing, Ministry of Education
Tsinghua National Laboratory for Information Science and Technology (TNList)
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
E-mail: fyc12@mails.tsinghua.edu.cn, {xumx,clh-dcs}@tsinghua.edu.cn, john.zy.wu@gmail.com
Abstract—Though emotion speech recognition has gained in-
creasing interest in the field of Human Computer Interaction, it
is still a challenge to automatically determine the emotion state
type and the boundaries of each emotionally salient segment
in continuous speech, which is named as Automatic Emotion
Variation Detection (AEVD). In this task, the input utterances
are not pre-segmented and may contain emotion variations. This
paper proposes a Multi-timescaled Sliding Window based AEVD
(MSW-AEVD). Firstly, a sliding window with fixed-length is em-
ployed to segment continuous speech for classic emotion recogni-
tion. An emotion type is assigned to each window-shift according
to the recognition results of all the sliding windows containing
that window-shift. Then this basic procedure is extended to multi-
timescaled sliding window, in which several different features are
utilized for different scales. Finally, a post-processing is employed
to refine the final outputs. In this work, we focus on anger-
neutral and happiness-neutral cases, which are mostly dominant
in recent studies of AEVD. Performance evaluation is carried
out across two databases, including German database EMO-DB
and Chinese database TH1309-DB. Experimental results show
that the proposed method outperforms HMM-based baseline
significantly.
I. INTRODUCTION
Speech is the most effective method in human commu-
nication. It is important to understand the emotion convey
in the speech in human interaction. Automatic detection of
user emotions becomes an attractive task in applications of
human-computer interaction (HCI): (a) On service, call center
changes into manual services when users are detected to
be dissatisfied; (b) On education, spoken tutoring system
adjusts teaching contents by emotion detection; (c) In medical-
emergency applications, emotion detection system is used to
detect stress, pain, fear or panic [1], [2], [3].
There have been a large number of published works on
speech emotion focusing on various tasks. Some studies at-
tempt to find the most relevant acoustic feature for emotion
in speech [4]. Some focus on the comparison of different
timescale in feature extraction [5]. Paralinguistic information
becomes a hotspot in this area [6]. Another module researchers
interested in is classifier. Except for common classifiers such
as Gaussian Mixture Model (GMM) and Support Vector Ma-
chine (SVM), recent studies also propose Multiple Classifier
Systems (MCS) [7] and other models that are concerned in
recent affective computing studies, e.g. Deep Neural Network
(DNN) [8], Long Short Term Memory (LSTM) [9].
Human emotion is expressed changefully in everyday com-
munication. Most of current studies focus on finding a single
emotion state representing the input speech, which cannot
describe variation. As the in-depth study of speech emotion,
some researchers attempt to find emotionally salient segments
on continuous speech, which we named Automatic Emotion
Variation Detection (AEVD). This task aims to find informa-
tion of each emotionally salient segment, e.g. emotion state
type, position, duration. Ref. [10] proposed a method based
Hidden Markov Model (HMM) trained specific model for each
emotion, and assign the most likely emotion to test utterance.In
[11], the position of emotion change points is detected by
modeling the shape of F0 contour and energy contour. Busso
C et al. [12] focus on tackling the inter-speaker variability.
The intersection of changing emotion score curves is used to
detect emotionally change in [13]. In [14], the emotion change
boundaries are detected first to segment the continuous speech.
The emotion of each segment is then predicted independently.
Similar tasks are also concerned in Music Information Re-
trieval (MIR), which is named as Music Emotion Variation
Detection (MEVD). In [15] a sliding window of 10 seconds
and 1/3 overlap is used to segment a music piece to make the
prediction respectively. System identification techniques are
utilized in [16] to model music emotion as a function of a
number of musical features.
Large amounts of studies have shown that emotion should
be analyzed in sufficient length of time, as it is a kind of
intrinsic expression. It is still a challenge to recognize the
exact emotion of a short-time segment directly. Besides, the
task of detection requires a precision time. For this reason, we
proposed a sliding window based AEVD aims at predicting
emotion of each window-shift interval. The window-shift
interval is named as a shift in our work, and it is the minimum
time granularity. Emotion recognition module is firstly utilized
in window level, then we assign the emotion type of a shift
by mapping all recognition results of windows containing
the shift. Various mapping strategies, features and window-
lengths are experimented. Then we fuse the results obtained
from multiple window-lengths and multiple features to form
the multi-timescaled sliding window based system. Finally we
evaluate our performance in various databases, including an
acted database and a simulated database.
This paper is organized as follows. Section 2 describes