深度学习驱动的语音识别：隐藏马尔科夫模型与深度神经网络的较量

需积分: 17 39 浏览量更新于2024-07-19 收藏 635KB PDF 举报

"这篇文献是关于使用深度学习技术实现语音识别(ASR)的，主要探讨了当前大多数语音识别系统采用隐马尔可夫模型(HMMs)处理语音的时变性和高斯混合模型(GMMs)确定HMM的每个状态如何适应声学输入帧的问题。文中提出了一种替代评估方法，即使用前馈神经网络，该网络接受几帧系数作为输入，并输出HMM状态的后验概率。深度神经网络(DNNs)在许多语音识别基准测试中已被证明能超越GMM，有时优势明显。文章概述了这一进展，并代表了四个在使用DNN进行语音识别声学建模方面取得近期成功的研究团队的共同观点。" 在语音识别领域，深度学习已经成为了现代技术的关键组成部分，特别是在自动语音识别（ASR）中。传统的ASR系统通常依赖于HMMs和GMMs的组合。HMMs用来捕捉语音信号的时间序列变化，而GMMs则用于估计每帧或短时窗口内的系数与HMM状态之间的概率分布。然而，这种方法存在一定的局限性，尤其是在处理复杂的语音模式和噪声环境时。深度神经网络（DNNs）的引入为ASR带来了革命性的改进。DNNs拥有多个隐藏层，能够学习更复杂的特征表示，从而更精确地拟合声学模型。通过新的训练方法，如反向传播和大数据集的使用，DNNs在多种语音识别基准测试中已经显示出了比GMMs更优的性能。这些进步不仅体现在识别准确率的提高，还在于模型的泛化能力和对噪声的鲁棒性增强。文献中提到，四个研究团队的成功经验表明，DNNs在声学建模方面的应用是ASR领域的前沿趋势。这些团队可能探索了不同的网络结构，如深度信念网络(Deep Belief Networks, DBNs)、卷积神经网络(Convolutional Neural Networks, CNNs)或递归神经网络(Recurrent Neural Networks, RNNs)，以适应不同的语音场景和任务需求。此外，他们可能还研究了如何有效地预处理语音数据，以及如何利用大规模的标注语料库进行训练。深度学习在语音识别中的应用还包括声学建模、语言模型和解码策略的改进。例如，声学模型的DNN可以学习到连续的声学特征，而语言模型则可以利用DNN预测下一个单词的概率，进一步提升识别的流畅性和准确性。此外，解码算法，如束搜索或在线学习策略，也可以与DNN集成以优化识别性能。深度学习在语音识别中的应用已经成为一个重要的研究方向，它显著提升了系统的识别效率和鲁棒性，为未来的智能语音交互系统提供了坚实的技术基础。随着硬件加速和计算能力的不断提升，我们可以期待深度学习在语音识别领域带来更多的创新和突破。

IEEE SIGNAL PROCESSING MAGAZINE [85] NOVEMBER 2012

from. A directed model generates data by first choosing the

states of the latent variables from a prior distribution and then

choosing the states of the observable variables from their condi-

tional distributions given the latent states. Examples of directed

models with one layer of latent variables are factor analysis, in

which the latent variables are drawn from an isotropic

Gaussian, and GMMs, in which they are drawn from a discrete

distribution. An undirected model has a very different way of

generating data. Instead of using one set of parameters to define

a prior distribution over the latent variables and a separate set

of parameters to define the condition-

al distributions of the observable vari-

ables given the values of the latent

variables, an undirected model uses a

single set of parameters, W, to define

the joint probability of a vector of val-

ues of the observable variables, v, and

a vector of values of the latent vari-

ables, h, via an energy function, E

vhW(, ; ) , ,p

eZe

vhW v h W

(,;) (,;)

(5)

where Z is called the partition function.

If many different latent variables interact nonlinearly to

generate each data vector, it is difficult to infer the states of

the latent variables from the observed data in a directed

model because of a phenomenon known as “explaining away”

[19]. In undirected models, however, inference is easy pro-

vided the latent variables do not have edges linking them.

Such a restricted class of undirected models is ideal for lay-

erwise pretraining because each layer will have an easy infer-

ence procedure.

We start by describing an approximate learning algorithm

for a restricted Boltzmann machine (RBM) which consists of a

layer of stochastic binary “visible” units that represent binary

input data connected to a layer of stochastic binary hidden units

that learn to model significant nonindependencies between the

visible units [20]. There are undirected connections between

visible and hidden units but no visible-visible or hidden-hidden

connections. An RBM is a type of Markov random field (MRF)

but differs from most MRFs in several ways: it has a bipartite

connectivity graph, it does not usually share weights between

different units, and a subset of the variables are unobserved,

even during training.

AN EFFICIENT LEARNING PROCEDURE FOR RBMs

A joint configuration, (v, h) of the visible and hidden units of an

RBM has an energy given by

vh()Eavbhvhw,

ijij

visible hidden

,ij

=- - -

///

, (6)

where

,vh

are the binary states of visible unit i and hidden

unit j,

,ab

are their biases, and

is the weight between

them. The network assigns a probability to every possible pair of

a visible and a hidden vector via this energy function as in (5)

and the probability that the network assigns to a visible vector,

v, is given by summing over all possible hidden vectors

v()p

v, h()E

. (7)

The derivative of the log probability of a training set with

respect to a weight is surprisingly simple

v()log

vh vh

ij ij

data model

12 12

, (8)

where N is the size of the

training set and the angle

brackets are used to denote

expectations under the dis-

tribution specified by the

subscript that follows. The

simple derivative in (8)

leads to a very simple learn-

ing rule for performing sto-

chastic steepest ascent in the log probability of the training data

wvh vh

data modelij i j i j

12 12eD =-

, (9)

where

e is a learning rate.

The absence of direct connections between hidden units in

an RBM makes it is very easy to get an unbiased sample of

ij data

. Given a randomly selected training case, v, the

binary state,

, of each hidden unit, j, is set to one with prob-

ability

v(1) ( )ph b vwlogistic

jjiij

;== +

(10)

and

is then an unbiased sample. The absence of direct con-

nections between visible units in an RBM makes it very easy to

get an unbiased sample of the state of a visible unit, given a hid-

den vector

() ( ).hpv a hw1 logistic

iijij

;== +

(11)

Getting an unbiased sample of

ij model

, however, is

much more difficult. It can be done by starting at any random

state of the visible units and performing alternating Gibbs sam-

pling for a very long time. Alternating Gibbs sampling consists

of updating all of the hidden units in parallel using (10) fol-

lowed by updating all of the visible units in parallel using (11).

A much faster learning procedure called contrastive diver-

gence (CD) was proposed in [20]. This starts by setting the states

of the visible units to a training vector. Then the binary states of

the hidden units are all computed in parallel using (10). Once

binary states have been chosen for the hidden units, a “recon-

struction” is produced by setting each

to one with a probabil-

ity given by (11). Finally, the states of the hidden units are

updated again. The change in a weight is then given by

()wvh vh

ij ij ijdata recon

12 12eD =-

. (12)

WHAT WE NEED IS A BETTER

METHOD OF USING THE INFORMATION

IN THE TRAINING SET TO BUILD

MULTIPLE LAYERS OF NONLINEAR

FEATURE DETECTORS.

剩余15页未读，继续阅读

tonight1103

粉丝: 13
资源: 9

深度学习驱动的语音识别：隐藏马尔科夫模型与深度神经网络的较量

Matlab在语音信号处理及仿真中的应用研究

基于MATLAB的语音信号处理技术研究

音频处理新工具：实现高效语音信号处理

语音信号处理 语音信号处理 语音信号处理

yyxhcl1.rar_语音信号处理_语音信号处理 课件_语音信号处理 matlab_语音信号处理 课件

语音信号处理_语音信号处理_语音信号_语音谱分析_

语音信号处理实验教程MATLAB源代码，内含语音信号处理基础、语音信号分析、基于Matlab的语音信号处理与仿真毕业论文

matlab代码，语音信号处理仿真，语音信号处理，阵列信号处理，宽带信号处理

dsp 语音信号处理 代码+报告_语音信号分析_语音信号滤波_matlab语音_时频分析作业_语音信号处理

语音信号处理_语言加噪_语音信号处理_数字信号处理实验_

最新资源

语音信号处理语音信号处理语音信号处理

yyxhcl1.rar_语音信号处理_语音信号处理课件_语音信号处理 matlab_语音信号处理课件

dsp 语音信号处理代码+报告_语音信号分析_语音信号滤波_matlab语音_时频分析作业_语音信号处理