傅立叶参数在语音情感识别中的应用

PDF格式 | 1.11MB | 更新于2024-08-30 | 49 浏览量 | 举报

"语音情感识别的研究近年来逐渐受到关注，特别是在使用和声特征方面。研究指出，一阶和二阶差异的和声特征对于识别语音中的情感至关重要。因此，提出了一种创新的傅立叶参数（FP）模型，该模型结合了语音质量的感知内容以及一阶和二阶差异，以实现说话者无关的语音情感识别。实验结果显示，FP特征在德语EMODB、中文CASIA和中国老年人EESDB情感数据库上显著提升了识别率，分别提高了16.2%、6.8%和16.6%。更进一步，当将FP与传统的梅尔频率倒谱系数（MFCC）特征结合使用时，识别率在各数据库上分别提升了17.5%、10%和10.5%。这项工作发表在IEEE Transaction on Affective Computing期刊上，展示了傅立叶参数在情感计算领域的潜力和优势。" 本文详细探讨了傅立叶参数在语音情感识别中的应用，指出了一种新的模型设计思路。傅立叶参数模型不仅考虑了语音信号的基本频域特性，还利用了和声特征的一阶和二阶差异，这些差异对于捕获语音中的情感信息至关重要。传统的梅尔频率倒谱系数虽然广泛应用于语音处理，但该研究表明，结合FP特征能够提升情感识别的准确性，尤其是在处理跨语言和跨年龄组数据时。在实验部分，作者对比了FP特征与MFCC特征的性能，并在三个不同的情感数据库上进行了验证。EMODB是德国的一个多情绪数据库，CASIA是中国的一个中文情感数据库，而EESDB则专注于中国老年人的情感表达。通过对这些数据库的测试，FP特征在所有情况下都表现出了优越的性能，这证明了其在不同文化和年龄群体中的普适性。此外，将FP与MFCC相结合的结果进一步强化了这一方法的有效性。这种融合策略利用了两种特征的互补性，从而提高了情感识别的综合性能。这种结合方法可能尤其适用于复杂和多样化的语音环境，因为它能够捕捉到更广泛的频率信息和情感变化。这篇论文揭示了傅立叶参数在语音情感识别中的潜在价值，为情感计算领域提供了一个新的分析工具。通过结合语音质量的感知内容和差异性特征，该模型有望在未来的情感识别系统中发挥重要作用，尤其是在跨文化和跨年龄的应用中。同时，这种技术的进一步优化和扩展可能促进人机交互、情感理解和心理健康等领域的进步。

http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TAFFC.2015.2392101, IEEE Transactions on Affective Computing

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID 1

Speech Emotion Recognition Using Fourier

Parameters

Kunxia Wang, Ning An, Senior Member, IEEE, Bing Nan Li, Senior Member, IEEE,

Yanyong Zhang, Member, IEEE, Lian Li, Member, IEEE

Abstract—Recently attention has been paid on harmony features for speech emotion recognition. It is found in our study that

the first- and second-order differences of harmony features also play an important role in speech emotion recognition. Therefore,

we propose a new Fourier parameter model by using the perceptual content of voice quality, the first- and second-order

differences for speaker-independent speech emotion recognition. Experiment results show that the proposed Fourier parameter

(FP) features are effective in identifying various emotion states in speech signals. They improve the recognition rates over the

methods using Mel Frequency Cepstral Coefficient (MFCC) features by 16.2 points, 6.8 points and 16.6 points on the German

database (EMODB), the Chinese language database (CASIA) and the Chinese elderly emotion database (EESDB). In particular,

if combining FP with MFCC, the recognition rates can be further improved by 17.5 points, 10 points and 10.5 points on the

aforementioned databases, respectively.

Index Terms—Fourier parameter model, speaker-independent, speech emotion recognition, affective computing

——————————



——————————

1 INTRODUCTION

peech emotion recognition, defined as extracting the

emotional states of a speaker from his or her speech, is

attracting more and more attention. It is believed that

speech emotion recognition can improve the performance

of speech recognition systems [1], and thus is very helpful

for criminal investigation, intelligent assistance [2], sur-

veillance and detection of potentially hazardous events

[3], and health care systems as well [4]. Speech emotion

recognition is particularly useful in man-machine interac-

tion [1],[6].

In order to effectively recognize emotions from speech

signals, the intrinsic features must be extracted from raw

speech data and transformed into appropriate formats

that are suitable for further processing. It is a longstand-

ing challenge in speech emotion recognition to extract

efficient speech features. Researchers have made a lot of

studies [6]-[12]. First, it is found that continuous features

including pitch-related features, formants features, ener-

gy-related features, timing features deliver important

emotional cues [7][11][31]. In addition to time-dependent

acoustic features, various spectral features such as linear

predictor coefficients (LPC) [32], linear predictor cepstral

coefficients (LPCC) [33] and mel-frequency cepstral coef-

ficients (MFCC) [45] play a significant role in speech emo-

tion recognition. Bou-Ghazale et al. [33] explored that the

features based on cepstral analysis, such as LPCC and

MFCC, outperform the linear features LPC in detecting

speech emotions. Third, the Teager-energy-operator

(TEO), introduced by Teager [35] and Kaiser [36], can be

used to detect stress in speech [37]. There are also other

TEO-based features proposed for detecting neutral versus

stressed speech [38]. Although the abovementioned fea-

tures turn out to be useful for recognizing specific emo-

tions, there is yet no sufficiently effective feature to de-

scribe complicated emotional states [13].

It has been demonstrated that voice quality features

are related to speech emotions [14],[15],[39],[40],[42],[54].

According to an extensive study by Cowie [11], the acous-

tic correlations with voice quality can be grouped into

voice level, pitch, phrase and feature boundaries and

temporal structures. There are two popular approaches

for deciding voice quality terms. The first one depends on

the fact that speech signals can be modelled as the output

of vocal tract filter excited by a glottal source signal [32];

hence voice quality can be measured by removing the

filtering effect of the vocal tract and by measuring the

parameters of the glottal signal [41]. However, the glottal

signal has to be estimated by exploiting the characteristics

of the source signal and the vocal tract filter because nei-

ther of them is known [1]. In the second approach, voice

quality is represented by the parameters estimated from

speech signals. In [39], voice quality was represented by

jitter and shimmer. The system for speaker-independent

speech emotion recognition used the continuous hidden

Markov model (HMM) as a classifier to detect some se-

lected speaking styles: angry, fast, question, slow and soft.

The baseline accuracy was 65.5 points when using MFCC

features only. The classification accuracy was improved

to 68.1 points when MFCC was combined with jitter, 68.5

points when MFCC was combined with shimmer and

69.1 points when MFCC was combined with both of them.

In [54], the voice quality parameters were estimated by

————————————————

 K.X. Wang is with the School of Computer and Information, Hefei

Univercity of Technology, and works in the Department of Electronic En-

gineering, Anhui Univercity of Architechture, Hefei, China. E-mail:

kxwang@ ahjzu.edu.cn.

 N. An and L. Li are with the School of Computer and Information, Hefei

Univercity of Technology, Hefei, China. E-mail: ning.g.an@acm.org,

llian@hfut.edu.cn

 B.N. Li is with the Department of Biomedical Engineering, Hefei

Univercity of Technology, Hefei, China. E-mail: bingoon@ieee.org

 Y.Y.Zhang is with WINLAB of Rutgers University, North Brunswick, NJ,

USA,E-mail: yyzhang@winlab.rutgers.edu

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38628926

粉丝: 2

傅立叶参数在语音情感识别中的应用

基于MFCC的语音情感识别研究

语音情感识别（matlab源代码）.zip

基于卷积神经网络模型的情绪识别技术在语音质检中的应用

基于傅里叶变换的声音识别

基于matlab的语音文字识别

语音mfcc特征提取并通过cnn深度学习训练实现语音情感识别+matlab操作视频

基于cnn的语音识别tensorflow

如何使用MATLAB实现基于傅里叶描述子的形状识别？请详细说明实现过程。

基于matlab的语音识别

基于CNN的语音识别系统

最新资源