End-to-End Deep Learning Framework for Speech Paralinguistics Detection
Based on Perception Aware Spectrum
Danwei Cai
12
, Zhidong Ni
12
, Wenbo Liu
1
, Weicheng Cai
1
, Gang Li
3
, Ming Li
12
1
School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China
2
SYSU-CMU Shunde International Joint Research Institute, Guangdong, China
3
Jiangsu Jinling Science and Technology Group Limited, Jiangsu, China
liming46@mail.sysu.edu.cn
Abstract
In this paper, we propose an end-to-end deep learning frame-
work to detect speech paralinguistics using perception aware
spectrum as input. Existing studies show that speech under
cold has distinct variations of energy distribution on low fre-
quency components compared with the speech under ‘healthy’
condition. This motivates us to use perception aware spectrum
as the input to an end-to-end learning framework with small
scale dataset. In this work, we try both Constant Q Transform
(CQT) spectrum and Gammatone spectrum in different end-to-
end deep learning networks, where both spectrums are able to
closely mimic the human speech perception and transform it
into 2D images. Experimental results show the effectiveness of
the proposed perception aware spectrum with end-to-end deep
learning approach on Interspeech 2017 Computational Paralin-
guistics Cold sub-Challenge. The final fusion result of our pro-
posed method is 8% better than that of the provided baseline in
terms of UAR.
Index Terms: computational paralinguistics, speech under
cold, deep learning, perception aware spectrum
1. Introduction
Speech paralinguistics study the non-verbal signals of speech
including accent, emotion, modulation, fluency and other per-
ceptible speech phenomena beyond the pure transcriptional
content of spoken speech [1]. With the advent of computational
paralinguistics, such phenomena can be analysed by machine
learning methods. The Interspeech COMPUTATIONAL PAR-
ALINGUISTICS CHALLENGE (COMPARE) is an open Chal-
lenge in the field of Computational Paralinguistics since 2009.
Interspeech 2017 ComParE Challenge addressed three new
problems within the field of Computational Paralinguistics: Ad-
dressee sub-challenge, Cold sub-challenge and Snoring sub-
challenge [2].
In this paper, we proposed an efficient deep learning archi-
tecture for Cold sub-challenge of the Interspeech 2017 Com-
putational Paralinguistics ChallengE [2]. The task aims to dif-
ferentiate the cold-affected speech from the ‘normal’ speech.
The baseline of challenge includes three independent systems.
The first two systems use traditional classification method (i.e.
This research was funded in part by the National Natural Sci-
ence Foundation of China (61401524), Natural Science Foundation of
Guangdong Province (2014A030313123), Natural Science Foundation
of Guangzhou City (201707010363), the Fundamental Research Funds
for the Central Universities (15lgjc12), National Key Research and De-
velopment Program (2016YFC0103905) and IBM Faculty Award.
SVM) with COMPARE features representation [3] and bag-of-
audio-words (BoAW) features representation [4], and achieve
unweighted average recall (UAR) of 64.0 and 64.2 respectively.
The third system employs end-to-end learning but only achieves
UAR of 59.1. Similar to [5], this system uses a convolutional
network to extract features from the raw audio and then a sub-
sequent recurrent network (i.e. LSTM) performs the final clas-
sification [2].
During the past few years, deep learning has made sig-
nificant progress. Deep learning methods outperform the tra-
ditional machine learning methods in variety of speech appli-
cations such as speech recognition [6], language recognition
[7], text-dependent speaker verification [8], emotion recogni-
tion [5], anti-spoofing tasks. This motivates us to apply deep
learning methods to computational paralinguistic tasks.
However, the end-to-end baseline system provided in [2]
did not achieve better UAR than the other two baseline systems.
One possible reason is that small scale dataset may not be able
to drive the deep neural network to learn a good feature directly
from waveform for classification, and hard to obtain a robust
feature for classification. We thus look into the frequency repre-
sentation (i.e spectrograms) to perform the end-to-end learning.
Spectrograms is a widely used audio signal feature representa-
tion in deep learning, which contain more wealth of acoustic
information.
Existing study shows that compared with speech in ‘health’
condition, the speech in cold has larger amplitude in low fre-
quency components and lower amplitude in high frequency
components. [9]. Also, from the viewpoint of a human audi-
tory perceptual system, human ears are more sensitive to small
changes in low frequencies [10]. This motivates us to use per-
ception aware spectrograms (i.e. Gammatone spectrograms and
Constant Q Transform spectrograms) as the input for end-to-
end deep learning framework when performing computational
paralinguistics tasks. Constant Q transform employs geomet-
rically spaced frequency bins and ensures a constant Q factor
across the entire spectrum. This results in a finer frequency
resolution at low frequencies while provides a higher temporal
resolution at high frequencies [11]. Gammatone spectrum em-
ploys Gammatone filters which are conceived as a simple fit to
experimental observations of the mammalian cochlea, and have
a repeated pole structure leading to an impulse response that is
the product of a gamma envelope g(t) = t
n
e
−t
and a sinusoid
(tone) [12, 13].
To the best of our knowledge, deep learning framework
with CQT spectrograms input has been successfully applied to
piano music transcription [14], audio scene classification and
domestic audio tagging [15]. But the performance of deep
learning framework with Gammatone spectrograms input still