多时间尺度滑动窗口自动情感变化检测

189 浏览量更新于2024-08-29 收藏 520KB PDF 举报

"自动情感变化检测在连续语音中的应用与挑战" 自动情感变化检测（Automatic Emotion Variation Detection, AEVD）是人机交互领域一个日益受到关注的研究课题，尤其是在连续语音处理中。AEVD的主要目标是识别出语音中的情感状态类型，并准确地定位每个情感显著段落的边界。这项任务的复杂性在于输入的语音片段并未预先分割，并且可能包含多种情感变化。本文提出了一种基于多时间尺度滑动窗口的AEVD方法（Multi-timescaled Sliding Window based AEVD, MSW-AEVD）。首先，利用固定长度的滑动窗口对连续语音进行分割，以便进行传统的语音情感识别。每个窗口根据识别结果被分配一个特定的情感类型。然后，为了捕捉不同时间尺度上的情感变化，多个具有不同宽度的滑动窗口被应用到同一语音片段上。这种方法能够适应不同持续时间和强度的情感事件，从而提高情感变化检测的准确性。在MSW-AEVD中，情感识别通常依赖于声学特征，如音高、能量、语调和韵律等。这些特征可以反映出说话者的情感状态。通过使用深度学习模型，如卷积神经网络（CNN）或长短期记忆网络（LSTM），可以从这些特征中学习到情感模式。模型的训练通常基于大量的标注数据集，其中包含了不同情感状态的样本，以确保模型能有效识别各种情感变化。此外，论文还可能讨论了模型评估和优化策略，例如交叉验证、混淆矩阵和F1分数等评估指标，以及正则化和超参数调整等优化手段。实验部分可能会对比MSW-AEVD与其他现有方法的性能，展示其在检测情感变化方面的优势。在实际应用中，AEVD技术可以广泛应用于智能助手、语音识别系统、心理咨询、情感分析等领域。它能帮助计算机更好地理解和响应人类的情感需求，提升人机交互的自然度和用户体验。然而，AEVD仍面临一些挑战，如情感识别的主观性、跨文化和个体差异、噪声环境下的识别准确性等，这些都是未来研究需要解决的关键问题。自动情感变化检测在连续语音中的研究旨在克服现有技术的局限，提供更精确的情感识别和定位。通过采用多时间尺度滑动窗口的方法，本文提出的MSW-AEVD有望成为一种有效的解决方案，推动情感语音识别技术的进步。

Automatic Emotion Variation Detection in

Continuous Speech

Yuchao Fan, Mingxing Xu, Zhiyong Wu, Lianhong Cai

Key Laboratory of Pervasive Computing, Ministry of Education

Tsinghua National Laboratory for Information Science and Technology (TNList)

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

E-mail: fyc12@mails.tsinghua.edu.cn, {xumx,clh-dcs}@tsinghua.edu.cn, john.zy.wu@gmail.com

Abstract—Though emotion speech recognition has gained in-

creasing interest in the ﬁeld of Human Computer Interaction, it

is still a challenge to automatically determine the emotion state

type and the boundaries of each emotionally salient segment

in continuous speech, which is named as Automatic Emotion

Variation Detection (AEVD). In this task, the input utterances

are not pre-segmented and may contain emotion variations. This

paper proposes a Multi-timescaled Sliding Window based AEVD

(MSW-AEVD). Firstly, a sliding window with ﬁxed-length is em-

ployed to segment continuous speech for classic emotion recogni-

tion. An emotion type is assigned to each window-shift according

to the recognition results of all the sliding windows containing

that window-shift. Then this basic procedure is extended to multi-

timescaled sliding window, in which several different features are

utilized for different scales. Finally, a post-processing is employed

to reﬁne the ﬁnal outputs. In this work, we focus on anger-

neutral and happiness-neutral cases, which are mostly dominant

in recent studies of AEVD. Performance evaluation is carried

out across two databases, including German database EMO-DB

and Chinese database TH1309-DB. Experimental results show

that the proposed method outperforms HMM-based baseline

signiﬁcantly.

I. INTRODUCTION

Speech is the most effective method in human commu-

nication. It is important to understand the emotion convey

in the speech in human interaction. Automatic detection of

user emotions becomes an attractive task in applications of

human-computer interaction (HCI): (a) On service, call center

changes into manual services when users are detected to

be dissatisﬁed; (b) On education, spoken tutoring system

adjusts teaching contents by emotion detection; (c) In medical-

emergency applications, emotion detection system is used to

detect stress, pain, fear or panic [1], [2], [3].

There have been a large number of published works on

speech emotion focusing on various tasks. Some studies at-

tempt to ﬁnd the most relevant acoustic feature for emotion

in speech [4]. Some focus on the comparison of different

timescale in feature extraction [5]. Paralinguistic information

becomes a hotspot in this area [6]. Another module researchers

interested in is classiﬁer. Except for common classiﬁers such

as Gaussian Mixture Model (GMM) and Support Vector Ma-

chine (SVM), recent studies also propose Multiple Classiﬁer

Systems (MCS) [7] and other models that are concerned in

recent affective computing studies, e.g. Deep Neural Network

(DNN) [8], Long Short Term Memory (LSTM) [9].

Human emotion is expressed changefully in everyday com-

munication. Most of current studies focus on ﬁnding a single

emotion state representing the input speech, which cannot

describe variation. As the in-depth study of speech emotion,

some researchers attempt to ﬁnd emotionally salient segments

on continuous speech, which we named Automatic Emotion

Variation Detection (AEVD). This task aims to ﬁnd informa-

tion of each emotionally salient segment, e.g. emotion state

type, position, duration. Ref. [10] proposed a method based

Hidden Markov Model (HMM) trained speciﬁc model for each

emotion, and assign the most likely emotion to test utterance.In

[11], the position of emotion change points is detected by

modeling the shape of F0 contour and energy contour. Busso

C et al. [12] focus on tackling the inter-speaker variability.

The intersection of changing emotion score curves is used to

detect emotionally change in [13]. In [14], the emotion change

boundaries are detected ﬁrst to segment the continuous speech.

The emotion of each segment is then predicted independently.

Similar tasks are also concerned in Music Information Re-

trieval (MIR), which is named as Music Emotion Variation

Detection (MEVD). In [15] a sliding window of 10 seconds

and 1/3 overlap is used to segment a music piece to make the

prediction respectively. System identiﬁcation techniques are

utilized in [16] to model music emotion as a function of a

number of musical features.

Large amounts of studies have shown that emotion should

be analyzed in sufﬁcient length of time, as it is a kind of

intrinsic expression. It is still a challenge to recognize the

exact emotion of a short-time segment directly. Besides, the

task of detection requires a precision time. For this reason, we

proposed a sliding window based AEVD aims at predicting

emotion of each window-shift interval. The window-shift

interval is named as a shift in our work, and it is the minimum

time granularity. Emotion recognition module is ﬁrstly utilized

in window level, then we assign the emotion type of a shift

by mapping all recognition results of windows containing

the shift. Various mapping strategies, features and window-

lengths are experimented. Then we fuse the results obtained

from multiple window-lengths and multiple features to form

the multi-timescaled sliding window based system. Finally we

evaluate our performance in various databases, including an

acted database and a simulated database.

This paper is organized as follows. Section 2 describes

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38675506

粉丝: 4
资源: 931

多时间尺度滑动窗口自动情感变化检测

real-time-emotion-detection-master.zip

emotion-detection:情绪检测程序所需的模型和图像

Emotion_detection

Emotion_Detection

emotion_detection

emotion_detection_experiments

Face-Emotion-Detection-

Facial-Emotion-Detection-

simple_face_emotion_detection

happy_emotion_detection.rar

最新资源