Applying PAD Three Dim
ensional Emotion Model to
Convert Prosody of Emotional Speech
1,
3
X
iaoyong Lu
2
Hong
wu Yang*
1
Aibao Z
hou
1
C
ollege of Psychology, Northwest Normal University, Lanzhou
2*
Co
llege of Physics and Electronic Engineering, Northwest Normal University, Lanzhou
3
Co
llege of Computer Science and Engineering, Northwest Normal University, Lanzhou
Email: yanghw@nwnu.edu.cn
Ab
stract—Happiness has attracted much attention of the
researchers in various fields. This paper realizes prosodic
conversion of emotional speech for happiness computing on
speech communication. An emotional speech corpus includes 11
kinds of typical emotional utterances is designed, where each
utterance is labeled the emotional information with PAD value in
a psychological sense. A five-scale tone model is employed to
model the pitch contour of emotional utterances on the syllable
level. A generalized regression neural network (GRNN) based
prosody conversion model is built to realize the transformation of
pitch contour, duration and pause duration of emotional
utterance, in which the PAD values of emotion and context
parameter are adopted to predict the prosodic features.
Emotional utterance is then re-synthesized with the STRAIGHT
algorithm by modifying pitch contour, duration and pause
duration. Experimental results on Emotional Mean Opining
Score (EMOS) demonstrate that the prosody conversion effect of
proposed method can express corresponding feelings.
Index Terms—happiness, PAD emotion model, five-scale tone
model, generalized regression neural network (GRNN),
STRAIGHT, prosody conversion.
I.
I
NT
RODUCTION
H
appiness is the eternal value pursuit of human, and is
the ultimate goal of social development. The basic outline of
happiness includes not only the cognitive component, but also
includes emotional component [1]. Since happiness is one of
the most important aspects for human communication, it has
been a hot topic on human-computer speech communication
including speech synthesis and speech recognition. A speech
synthesis system can synthesize human-like utterances.
Though current speech synthesis system is generally accepted
by users on its intelligibility and naturalness, synthetic speech
is primarily presented to users with neutral intonation, which
lacks the rich emotional express. Therefore, high performance
speech synthesis has become a hot study point of speech
engineering in recent years [2]. Emotional speech synthesis
mainly adopts speech synthesis methods based on Hidden
Markov Model (HMM) [3] and large-corpus based
concatenation method [4]. Although the former can use the
method of speaker adaptation transform [5-6] to realize
emotional speech synthesis, the quality of synthesized speech
is hard to be accepted by users. Though speech synthesis by
large-corpus based concatenation method can achieve high
naturalness, it is very difficult to record different emotional
corpus. Therefore, some studies proposed methods to realize
emotional speech synthesis through prosody conversion. Four
basic emotions is selected in [7] to realize the conversion the
related prosodic characteristics of emotional speech. The PAD
three dimensional emotion model is also employed [8-9] obtain
synthetic emotional speech. The emotional speech conversion
is also achieved by using PAD emotion model [10]. The SVR
is also used to predict emotional prosody parameter [11].
However, these studies lack of the modeling of fundamental
frequency contour.
In order to convert F0 envelop in emotional speech
conversion, the paper builds 11 kinds of typical emotional text
corpus and records relative speech corpus. The PAD values of
speech corpus are labeled with psychological method. We also
build a syllabic F0 model with Five-scale Tone Model [12]. A
predicting model of emotional speech prosody parameter is
constructed with Generalized Regression Neural Network
(GRNN). The model can predict the prosodic features of target
emotional speech according to the PAD value and contextual
features of sentences. Finally, STRAIGHT[13] algorithm is
exploited to achieve emotional speech conversion. The
experimental results show that the converted speech can
express the target emotion.
II. PAD
T
HREE DIMENSIONAL EMOTION MODEL
T
he main methods for describe emotion [14] include
category representation and dimensional representation. Since
the category description is difficult to describe the mixed
emotions, the paper adopts PAD three dimensional emotion
model to describe emotional speech.
PAD three dimensional emotion model [15] is composed of
three dimensionalities: 1) Pleasure-Displeasure which means
the positive and negative of the emotional state; 2) Arousal-
Non-arousal which means emotional psychological activation
level and alertness; 3) Dominance-Submissiveness which
means the control and influence of the emotions of others and
the external environment.
A. Text corpus
We respectively select one or two common emotions that
can represent its quadrant from each quadrant of PAD three
dimensional space. These common emotions have 11 species
978-1-4799-6284-6/14/$31.00
c
2014
IEEE
89