基于感知谱的语音副语言学检测端到端深度学习框架

78 浏览量更新于2024-08-28 收藏 763KB PDF 举报

"这篇研究论文提出了一种基于感知谱的端到端深度学习框架，用于语音副语言学检测。该框架利用感知谱作为输入，针对小规模数据集进行端到端学习，尤其关注寒冷环境下语音能量分布的变化。" 在本文中，作者Danwei Cai、Zhidong Ni、Wenbo Liu、Weicheng Cai、Gang Li和Ming Li（分别来自中山大学电子与信息技术学院、SYSU-CMU顺德国际联合研究院和江苏金菱科技集团有限公司）探讨了如何利用深度学习技术改进语音副语言学的检测方法。语音副语言学是指通过说话方式而非词汇内容传达的信息，例如情绪、健康状况等。现有的研究表明，在寒冷环境下，语音的能量分布在其低频成分上与正常条件下的语音有显著差异。这一发现激发了研究人员使用感知谱作为输入，构建一个端到端的深度学习框架。感知谱旨在模拟人类对语音的感知，将声音转化为2D图像，以便于机器学习模型处理。论文中，作者尝试了两种不同的谱表示：恒定Q变换（CQT）谱和伽马调音谱。这两种谱都能紧密地模仿人类对语音的感知，并将其转化为适合神经网络处理的二维图像。通过在不同的端到端深度学习网络中应用这些谱，他们能够训练模型来识别和理解这些声学特征，从而实现对不同环境或生理状态下的语音副语言特征的精确检测。这种方法的一个关键优势在于其端到端性质，允许模型直接从原始音频数据中学习，无需手动特征工程。这降低了对大量领域知识的依赖，并且能够随着更多数据的引入持续优化。此外，由于使用的是感知谱，模型能够捕获更接近人类感知的语音特性，从而提高检测的准确性和鲁棒性。尽管如此，论文中并未详细讨论实验结果和性能比较，但可以推断，作者可能已经进行了对比实验，以证明所提出的框架相对于传统方法或使用其他谱表示方法的优势。这种基于感知谱的深度学习框架对于未来语音识别、情感分析以及健康监测等领域具有潜在的应用价值，特别是在有限的数据条件下仍能保持良好性能。这篇研究论文展示了深度学习在处理复杂语音信号时的潜力，尤其是在理解和检测副语言学特征方面。通过利用感知谱并采用端到端学习，这种方法有望推动语音处理技术的发展，使得机器能够更好地理解人类的非言语信息。

End-to-End Deep Learning Framework for Speech Paralinguistics Detection

Based on Perception Aware Spectrum

Danwei Cai

, Zhidong Ni

, Wenbo Liu

, Weicheng Cai

, Gang Li

, Ming Li

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China

SYSU-CMU Shunde International Joint Research Institute, Guangdong, China

Jiangsu Jinling Science and Technology Group Limited, Jiangsu, China

liming46@mail.sysu.edu.cn

Abstract

In this paper, we propose an end-to-end deep learning frame-

work to detect speech paralinguistics using perception aware

spectrum as input. Existing studies show that speech under

cold has distinct variations of energy distribution on low fre-

quency components compared with the speech under ‘healthy’

condition. This motivates us to use perception aware spectrum

as the input to an end-to-end learning framework with small

scale dataset. In this work, we try both Constant Q Transform

(CQT) spectrum and Gammatone spectrum in different end-to-

end deep learning networks, where both spectrums are able to

closely mimic the human speech perception and transform it

into 2D images. Experimental results show the effectiveness of

the proposed perception aware spectrum with end-to-end deep

learning approach on Interspeech 2017 Computational Paralin-

guistics Cold sub-Challenge. The ﬁnal fusion result of our pro-

posed method is 8% better than that of the provided baseline in

terms of UAR.

Index Terms: computational paralinguistics, speech under

cold, deep learning, perception aware spectrum

1. Introduction

Speech paralinguistics study the non-verbal signals of speech

including accent, emotion, modulation, ﬂuency and other per-

ceptible speech phenomena beyond the pure transcriptional

content of spoken speech [1]. With the advent of computational

paralinguistics, such phenomena can be analysed by machine

learning methods. The Interspeech COMPUTATIONAL PAR-

ALINGUISTICS CHALLENGE (COMPARE) is an open Chal-

lenge in the ﬁeld of Computational Paralinguistics since 2009.

Interspeech 2017 ComParE Challenge addressed three new

problems within the ﬁeld of Computational Paralinguistics: Ad-

dressee sub-challenge, Cold sub-challenge and Snoring sub-

challenge [2].

In this paper, we proposed an efﬁcient deep learning archi-

tecture for Cold sub-challenge of the Interspeech 2017 Com-

putational Paralinguistics ChallengE [2]. The task aims to dif-

ferentiate the cold-affected speech from the ‘normal’ speech.

The baseline of challenge includes three independent systems.

The ﬁrst two systems use traditional classiﬁcation method (i.e.

This research was funded in part by the National Natural Sci-

ence Foundation of China (61401524), Natural Science Foundation of

Guangdong Province (2014A030313123), Natural Science Foundation

of Guangzhou City (201707010363), the Fundamental Research Funds

for the Central Universities (15lgjc12), National Key Research and De-

velopment Program (2016YFC0103905) and IBM Faculty Award.

SVM) with COMPARE features representation [3] and bag-of-

audio-words (BoAW) features representation [4], and achieve

unweighted average recall (UAR) of 64.0 and 64.2 respectively.

The third system employs end-to-end learning but only achieves

UAR of 59.1. Similar to [5], this system uses a convolutional

network to extract features from the raw audio and then a sub-

sequent recurrent network (i.e. LSTM) performs the ﬁnal clas-

siﬁcation [2].

During the past few years, deep learning has made sig-

niﬁcant progress. Deep learning methods outperform the tra-

ditional machine learning methods in variety of speech appli-

cations such as speech recognition [6], language recognition

[7], text-dependent speaker veriﬁcation [8], emotion recogni-

tion [5], anti-spooﬁng tasks. This motivates us to apply deep

learning methods to computational paralinguistic tasks.

However, the end-to-end baseline system provided in [2]

did not achieve better UAR than the other two baseline systems.

One possible reason is that small scale dataset may not be able

to drive the deep neural network to learn a good feature directly

from waveform for classiﬁcation, and hard to obtain a robust

feature for classiﬁcation. We thus look into the frequency repre-

sentation (i.e spectrograms) to perform the end-to-end learning.

Spectrograms is a widely used audio signal feature representa-

tion in deep learning, which contain more wealth of acoustic

information.

Existing study shows that compared with speech in ‘health’

condition, the speech in cold has larger amplitude in low fre-

quency components and lower amplitude in high frequency

components. [9]. Also, from the viewpoint of a human audi-

tory perceptual system, human ears are more sensitive to small

changes in low frequencies [10]. This motivates us to use per-

ception aware spectrograms (i.e. Gammatone spectrograms and

Constant Q Transform spectrograms) as the input for end-to-

end deep learning framework when performing computational

paralinguistics tasks. Constant Q transform employs geomet-

rically spaced frequency bins and ensures a constant Q factor

across the entire spectrum. This results in a ﬁner frequency

resolution at low frequencies while provides a higher temporal

resolution at high frequencies [11]. Gammatone spectrum em-

ploys Gammatone ﬁlters which are conceived as a simple ﬁt to

experimental observations of the mammalian cochlea, and have

a repeated pole structure leading to an impulse response that is

the product of a gamma envelope g(t) = t

−t

and a sinusoid

(tone) [12, 13].

To the best of our knowledge, deep learning framework

with CQT spectrograms input has been successfully applied to

piano music transcription [14], audio scene classiﬁcation and

domestic audio tagging [15]. But the performance of deep

learning framework with Gammatone spectrograms input still

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38602098

粉丝: 3
资源: 963

基于感知谱的语音副语言学检测端到端深度学习框架

matlab说话代码-ParalinguisticSpeechAnalyzer:不是分析您所说的内容，而是分析您的怎么说

deepschmatzing

数据库基础测验20241113.doc

微信小程序下拉选择组件

DICOM文件+DX放射平片-数字X射线图像DICOM测试文件

Jupyter Notebook《基于双流 Faster R-CNN 网络的 图像篡改检测》+项目源码+文档说明+代码注释

使用epf捕获没有CA证书的SSLTLS明文（LinuxAndroid内核支持amd64arm64）.zip

(源码)基于Arduino的天文数据库管理系统.zip

(源码)基于JSP和SQL Server的维修管理系统.zip

devecostudio-windows-3.1.0.501.zip

最新资源

Jupyter Notebook《基于双流 Faster R-CNN 网络的图像篡改检测》+项目源码+文档说明+代码注释