Google深度LSTM模型提升语音识别准确率

需积分: 50 131 浏览量更新于2024-09-08 收藏 322KB PDF 举报

"本文主要探讨了Google在语音识别领域的进展，特别是使用深度长短期记忆（LSTM）递归神经网络（RNN）作为声学模型的优化策略。研究指出，LSTM RNN在语音识别任务中优于传统的前馈深度神经网络（DNN），并且序列训练的上下文依赖（CD）隐藏马尔科夫模型（HMM）的性能可以与使用CTC初始化的序列训练的音素模型相媲美。" 在语音识别领域，Google已经证明了深度LSTM RNNs在声学建模方面超越了DNNs。LSTM RNNs因其强大的序列学习能力，能够更好地捕捉语音信号中的长期依赖性，这对于理解和解析复杂的语言结构至关重要。LSTM的这种特性使得它们在处理连续语音流时，能更准确地识别出每个单词或音素。更进一步，研究人员发现，通过序列训练的CD phoneme模型，LSTM RNNs的表现可以得到进一步提升。CD phoneme模型允许模型根据上下文调整其预测，从而提高识别精度。此外，使用连接ist时间分类（CTC）初始化这些模型，使得模型无需预先定义的帧对齐也能进行有效的训练，简化了模型的训练过程。文章还提出了两种技术来增强LSTM RNN声学模型的性能。首先，帧堆叠（frame stacking）是将连续的音频帧组合在一起输入到模型中，这有助于模型捕捉到语音的短时特征和动态变化。其次，降低帧率（reduced frame rate）可以在保持识别精度的同时，减少计算量，从而加快解码速度。最后，研究人员初步探索了LSTM RNN直接输出单词的模型，这一方法可能消除中间的音素阶段，直接将声音信号转化为文字，简化了模型架构并可能提高整体识别效率。这种直接的词级输出方法对于实时语音识别应用如智能助手或语音搜索等具有重要意义。 Google在语音识别方面的研究不断推进，通过LSTM RNNs和CTC等技术，持续提升模型的准确性和效率，为语音识别技术的发展奠定了坚实基础。

Fast and Accurate Recurrent Neural Network Acoustic Models for Speech

Recognition

Has¸im Sak, Andrew Senior, Kanishka Rao, Franc¸oise Beaufays

Google

{hasim,andrewsenior,kanishkarao,fsb}@google.com

Abstract

We have recently shown that deep Long Short-Term Memory

(LSTM) recurrent neural networks (RNNs) outperform feed

forward deep neural networks (DNNs) as acoustic models for

speech recognition. More recently, we have shown that the

performance of sequence trained context dependent (CD) hid-

den Markov model (HMM) acoustic models using such LSTM

RNNs can be equaled by sequence trained phone models initial-

ized with connectionist temporal classiﬁcation (CTC). In this

paper, we present techniques that further improve performance

of LSTM RNN acoustic models for large vocabulary speech

recognition. We show that frame stacking and reduced frame

rate lead to more accurate models and faster decoding. CD

phone modeling leads to further improvements. We also present

initial results for LSTM RNN models outputting words directly.

Index Terms: speech recognition, acoustic modeling, connec-

tionist temporal classiﬁcation, CTC, long short-term memory

recurrent neural networks, LSTM RNN.

1. Introduction

While speech recognition systems using recurrent and feed-

forward neural networks have been around for more than two

decades [1, 2], it is only recently that they have displaced Gaus-

sian mixture models (GMMs) as the state-of-the-art acoustic

model. More recently, it has been shown that recurrent neural

networks can outperform feed-forward networks on large-scale

speech recognition tasks [3, 4].

Conventional speech systems use cross-entropy training

with HMM CD state targets followed by sequence training.

CTC models use a “blank” symbol between phonetic labels and

propose an alternative loss to conventional cross-entropy train-

ing. We recently showed that RNNs for LVCSR trained with

CTC can be improved with the sMBR sequence training crite-

rion and approaches state-of-the-art [5]. In this paper we further

investigate the use of sMBR-trained CTC models for acoustic

speech recognition and show that with appropriate features and

the introduction of context dependent phone models they out-

perform the conventional LSTM RNN models by 8% relative

in recognition accuracy. The next section describes the LSTM

RNNs and summarizes the CTC method and sequence training.

We then describe acoustic frame stacking as well as context de-

pendent phone and whole-word modeling. The following sec-

tion describes our experiments and presents results which are

summarized in the conclusions.

2. RNN Acoustic Modeling Techniques

In this work we focus on the LSTM RNN architecture which

has shown good performance in our previous research, outper-

forming deep neural networks.

input

Forward Forward Forward Forward Forward

output

640 500 500 500 500 500 42 or 9248

input

Forward

Backward

Forward

Backward

Forward

Backward

Forward

Backward

Forward

Backward

output

240

300

42 or 9248

Figure 1: Layer connections in unidirectional (top) and bidirec-

tional (bottom) 5-layer LSTM RNNs.

RNNs model the input sequence either unidirectionally or

bidirectionally [6]. Unidirectional RNNs (Figure 1 top) esti-

mate the label posteriors y

= p(l

−→

) using only left con-

text of the current input x

by processing the input from left

to right and having a hidden state

−→

in the forward direction.

This is desirable for applications requiring low latency between

inputs and corresponding outputs. Usually output targets are de-

layed with respect to features, giving access to a small amount

of right/future context, improving classiﬁcation accuracy with-

out incurring much latency.

If one can afford the latency of seeing the entire sequence,

bidirectional RNNs (Figure 1 bottom) estimate the label poste-

riors p(l

−→

←−

) using separate layers for processing the

input in the forward and backward directions. We use deep

LSTM RNN architectures built by stacking multiple LSTM lay-

ers. These have been shown to perform better than shallow

models for speech recognition [7, 8, 9, 3]. For bidirectional

models, we use two LSTM layers at each depth — one operat-

ing in the forward and another operating in the backward direc-

tion over the input sequence. Both of these layers are connected

to both the previous forward and backward layers. The output

layer is also connected to both of the ﬁnal forward and back-

ward layers. We experiment with different acoustic units for

the output layer, including context dependent HMM states and

phones, both context independent and context dependent (Sec-

tion 2.4). We train the models in a distributed manner using

asynchronous stochastic gradient descent (ASGD) optimization

technique allowing parallelization of training over a large num-

ber of machines on a cluster and enabling large scale training

of neural networks [10, 11, 12, 13, 3]. The weights in all the

networks are randomly initialized with a uniform (-0.04, 0.04)

distribution. We clip the activations of memory cells to [-50,

50], and their gradients to [-1, 1], making CTC training stable.

2.1. CTC Training

The CTC approach [14] is a technique for sequence labeling

using RNNs where the alignment between the inputs and tar-

arXiv:1507.06947v1 [cs.CL] 24 Jul 2015

下载后可阅读完整内容，剩余4页未读，立即下载

土老师2050

粉丝: 24
资源: 8

Google深度LSTM模型提升语音识别准确率

使用GoogleSpeech API实现声音识别

Google-Speech-Recognition-:Android 中的 Google 语音识别

测试Google语音识别：SBER AR-VR 2021。 谷歌语音识别

谷歌语音识别

谷歌语音识别-百度语音识别

谷歌语音识别-百度语音识别-android

GoogleSpeech:Google语音识别

VoiceRecongnition（google语音识别输入文字，带下载google语音安装包)

google语音识别api

最新资源

测试Google语音识别：SBER AR-VR 2021。谷歌语音识别