1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2015.2392101, IEEE Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID 1
Speech Emotion Recognition Using Fourier
Parameters
Kunxia Wang, Ning An, Senior Member, IEEE, Bing Nan Li, Senior Member, IEEE,
Yanyong Zhang, Member, IEEE, Lian Li, Member, IEEE
Abstract—Recently attention has been paid on harmony features for speech emotion recognition. It is found in our study that
the first- and second-order differences of harmony features also play an important role in speech emotion recognition. Therefore,
we propose a new Fourier parameter model by using the perceptual content of voice quality, the first- and second-order
differences for speaker-independent speech emotion recognition. Experiment results show that the proposed Fourier parameter
(FP) features are effective in identifying various emotion states in speech signals. They improve the recognition rates over the
methods using Mel Frequency Cepstral Coefficient (MFCC) features by 16.2 points, 6.8 points and 16.6 points on the German
database (EMODB), the Chinese language database (CASIA) and the Chinese elderly emotion database (EESDB). In particular,
if combining FP with MFCC, the recognition rates can be further improved by 17.5 points, 10 points and 10.5 points on the
aforementioned databases, respectively.
Index Terms—Fourier parameter model, speaker-independent, speech emotion recognition, affective computing
——————————
——————————
1 INTRODUCTION
peech emotion recognition, defined as extracting the
emotional states of a speaker from his or her speech, is
attracting more and more attention. It is believed that
speech emotion recognition can improve the performance
of speech recognition systems [1], and thus is very helpful
for criminal investigation, intelligent assistance [2], sur-
veillance and detection of potentially hazardous events
[3], and health care systems as well [4]. Speech emotion
recognition is particularly useful in man-machine interac-
tion [1],[6].
In order to effectively recognize emotions from speech
signals, the intrinsic features must be extracted from raw
speech data and transformed into appropriate formats
that are suitable for further processing. It is a longstand-
ing challenge in speech emotion recognition to extract
efficient speech features. Researchers have made a lot of
studies [6]-[12]. First, it is found that continuous features
including pitch-related features, formants features, ener-
gy-related features, timing features deliver important
emotional cues [7][11][31]. In addition to time-dependent
acoustic features, various spectral features such as linear
predictor coefficients (LPC) [32], linear predictor cepstral
coefficients (LPCC) [33] and mel-frequency cepstral coef-
ficients (MFCC) [45] play a significant role in speech emo-
tion recognition. Bou-Ghazale et al. [33] explored that the
features based on cepstral analysis, such as LPCC and
MFCC, outperform the linear features LPC in detecting
speech emotions. Third, the Teager-energy-operator
(TEO), introduced by Teager [35] and Kaiser [36], can be
used to detect stress in speech [37]. There are also other
TEO-based features proposed for detecting neutral versus
stressed speech [38]. Although the abovementioned fea-
tures turn out to be useful for recognizing specific emo-
tions, there is yet no sufficiently effective feature to de-
scribe complicated emotional states [13].
It has been demonstrated that voice quality features
are related to speech emotions [14],[15],[39],[40],[42],[54].
According to an extensive study by Cowie [11], the acous-
tic correlations with voice quality can be grouped into
voice level, pitch, phrase and feature boundaries and
temporal structures. There are two popular approaches
for deciding voice quality terms. The first one depends on
the fact that speech signals can be modelled as the output
of vocal tract filter excited by a glottal source signal [32];
hence voice quality can be measured by removing the
filtering effect of the vocal tract and by measuring the
parameters of the glottal signal [41]. However, the glottal
signal has to be estimated by exploiting the characteristics
of the source signal and the vocal tract filter because nei-
ther of them is known [1]. In the second approach, voice
quality is represented by the parameters estimated from
speech signals. In [39], voice quality was represented by
jitter and shimmer. The system for speaker-independent
speech emotion recognition used the continuous hidden
Markov model (HMM) as a classifier to detect some se-
lected speaking styles: angry, fast, question, slow and soft.
The baseline accuracy was 65.5 points when using MFCC
features only. The classification accuracy was improved
to 68.1 points when MFCC was combined with jitter, 68.5
points when MFCC was combined with shimmer and
69.1 points when MFCC was combined with both of them.
In [54], the voice quality parameters were estimated by
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society
————————————————
K.X. Wang is with the School of Computer and Information, Hefei
Univercity of Technology, and works in the Department of Electronic En-
gineering, Anhui Univercity of Architechture, Hefei, China. E-mail:
kxwang@ ahjzu.edu.cn.
N. An and L. Li are with the School of Computer and Information, Hefei
Univercity of Technology, Hefei, China. E-mail: ning.g.an@acm.org,
llian@hfut.edu.cn
B.N. Li is with the Department of Biomedical Engineering, Hefei
Univercity of Technology, Hefei, China. E-mail: bingoon@ieee.org
Y.Y.Zhang is with WINLAB of Rutgers University, North Brunswick, NJ,
USA,E-mail: yyzhang@winlab.rutgers.edu