深度神经网络驱动的语音带宽扩展技术

105 浏览量更新于2024-08-27 收藏 324KB PDF 举报

"基于深度神经网络的语音带宽扩展方法" 这篇研究论文提出了一种创新的语音带宽扩展技术，利用深度神经网络（DNNs）构建低频成分与高频成分之间的高阶特征空间。该技术旨在提高语音信号的频率范围，从而改善其质量和可理解性，尤其是在受限的频谱条件下。作者团队包括来自北京理工大学、卡内基梅隆大学、中山大学SYSU-CMU联合工程学院、卡内基梅隆大学电子与计算机工程系以及SYSU-CMU顺德国际联合研究院的研究人员。他们设计了一个四层的DNN结构，这个结构是从级联的神经网络（NNs）和两个高斯伯努利受限玻尔兹曼机（GBRBMs）逐层训练而成的。 GBRBMs在这项工作中扮演了关键角色，分别用于建模低频和高频部分的频谱包络分布。这些模型能够捕捉到不同频率段内在的统计特性，从而更好地重建原始语音信号的高频信息。随后，神经网络被用来建模从两个GBRBMs提取的隐藏变量的联合分布。这种联合建模有助于捕捉低频和高频成分之间的复杂关系。论文中的方法利用深度学习的表达能力和泛化能力，能够在训练数据的基础上学习到有效的特征表示。通过这种方式，即使在只包含低频信息的输入语音信号下，也能恢复出丢失的高频细节。这在语音压缩、无线通信、音频编码等领域具有广泛应用潜力，特别是在需要在有限带宽条件下传输高质量语音的场景。此外，该方法可能还包括对损失函数的设计，以便优化模型在保留语音自然度和清晰度方面的性能。训练过程中可能采用了反向传播算法来更新网络权重，并可能使用了验证集进行超参数调优和过拟合控制。尽管论文没有详细说明具体的训练过程和实验结果，但可以推测该方法在一系列标准的语音质量评估指标上表现优秀，比如客观的频谱失真度测量和主观的MOS（Mean Opinion Score）评分。这篇论文提出的深度学习驱动的语音带宽扩展技术为语音处理领域带来了新的解决方案，它利用深度神经网络的建模能力来恢复和增强语音信号的高频成分，有望在未来的技术发展中进一步提升语音通信的质量和用户体验。

Speech bandwidth expansion based on Deep Neural Networks

Yingxue Wang

1,2

, Shenghui Zhao

, Wenbo Liu

3,4

, Ming Li

3,5

,Jingming Kuang

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

School of Computer Science, Carnegie Mellon University

SYSU-CMU Joint Inst. of Eng., Sun Yat-Sen University

Department of ECE, Carnegie Mellon University

SYSU-CMU Shunde International Joint Research Institute

yxwang.bit@gmail.com, shzhao@bit.edu.cn

Abstract

This paper proposes a new speech bandwidth expansion

method, which uses Deep Neural Networks (DNNs) to build

high-order eigenspaces between the low frequency components

and the high frequency components of the speech signal. A

four-layer DNN is trained layer-by-layer from a cascade of

Neural Networks (NNs) and two Gaussian-Bernoulli Restrict-

ed Boltzmann Machines (GBRBMs). The GBRBMs are adopt-

ed to model the distribution of spectral envelopes of the low

frequency and the high frequency respectively. The NNs are

used to model the joint distribution of hidden variables extract-

ed from the two GBRBMs. The proposed method takes advan-

tage of the strong modeling ability of GBRBMs in modeling

the distribution of the spectral envelopes. And both the objec-

tive and subjective test results show that the proposed method

outperforms the conventional GMM based method.

Index Terms: bandwidth extension, deep neural networks, neu-

ral networks, Gaussian-Bernoulli Restricted Boltzmann Ma-

chine

1. Introduction

Speech bandwidth expansion (BWE) is a technique that at-

tempts to improve the speech quality by recovering the missing

high frequency components using the correlation that exists be-

tween the low and high frequency parts of the wide-band speech

signal. The BWE techniques have been applied to various tasks,

such as speech recognition [1], multicast conference [2], etc.

Many approaches have been proposed for speech bandwidth ex-

tension during the last decades. Generally, these methods can

be classiﬁed into two categories: rule-based methods and statis-

tical methods. The rule based methods directly regenerate the

high frequency spectral based on the acoustical knowledge of

the speech signal, e.g. simply copying a portion of the narrow-

band spectrum onto the desired extension frequency compo-

nents [3]. On the other hand, the statistical methods employ

statistical models to estimate the mapping function between the

low frequency and high frequency spectral features [4, 5, 6, 7].

By contrast to rule-based methods, statistical methods can con-

struct more precise mapping functions using statistical models.

Therefore, statistical methods, especially the GMM-based B-

WE methods are widely used [5].

Motivated by the success of Deep Neural Networks (DNN)

in speech recognition [8], we propose to utilize DNN to estimate

a robust mapping function for speech bandwidth extension. Dif-

ferent from the conventional non-linear or linear transformation

approaches, the DNN learns both a linear and a non-linear re-

lationship between the low frequency and high frequency spec-

tral envelopes. Thus, DNN can learn a more detailed and pre-

cise relationship between the low frequency and high frequen-

cy. In our approach, different from the conventional feedfor-

ward neural networks for regression tasks, which are usually

trained using the back-propagation algorithm under the mini-

mum mean square error criterion, a four-layer DNN is trained

layer-by-layer from a cascade of Neural Networks (NNs) and

two Gaussian-Bernoulli Restricted Boltzmann Machines (G-

BRBMs). In the training phase, we ﬁrst train two exclusive

GBRBMs for low frequency and high frequency to obtain the

deep networks that capture abstractions for each speech. Then,

low frequency feature vectors and high frequency feature vec-

tors are fed into their corresponding GBRBM and high-order

features produced by GBRBMs are used to train a concatenat-

ing neural network between the two GBRBMs. In the recon-

struction phase, the low frequency signal is converted through

the trained NNs in the high-order space, and brought back to the

cepstrum space using the inverse process of the high frequency

GBRBM.

This paper is organized as follows. Section 2 gives an

overview of RBM and GBRBM while section 3 explains our

speech bandwidth extension method. We show our setup and

experimental results in section 4, and section 5 is our conclu-

sion.

2. Preliminaries

Our speech bandwidth extension method uses GBRBM to cap-

ture high-order features. We brieﬂy review the GBRBM and its

fundamental model, Restricted Boltzmann machine (RBM), in

this section.

2.1. RBM

A RBM is a bipartite undirected graphical model. It has a two-

layer structure with one visible layer corresponding to a set of

visible stochastic variables v = [v

, ...v

]

and one hidden

layer corresponding to a set of hidden stochastic variables h =

, ...h

]

, where V and H denote the number of units in the

visible and hidden layers [9]. The joint probability p(v, h) of

binary-valued visible units v and binary-valued hidden units h

is deﬁned as follows:

p(v, h) =

exp(−E(v, h)) (1)

E(v, h) = −av

−T

− bh

−T

− v

−T

W h (2)

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38656226

粉丝: 3
资源: 928

深度神经网络驱动的语音带宽扩展技术

High Bandwidth Sensorless Algorithm for AC Machines Based on Square-wave Type

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB .pdf

Research on Multi-Access Communication Technology Based on Deep Learning

on-chip networks pdf

auto bandwidth

kvm nic bandwidth

NoC-based SoC Design

非对称上下文调制算法ACM

最新资源