使用深度RNN模型和WFST解码的EESEN端到端语音识别

语音识别

需积分: 10 72 浏览量更新于2024-09-08 收藏 608KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"EESEN（End-to-End Speech Recognition）是一种使用深度循环神经网络（Deep RNN）模型和基于WFST（Weighted Finite-State Transducers）的解码方法的端到端语音识别框架。该框架旨在简化构建最先进的ASR系统的流程，减少所需的资源和专业知识。在EESEN中，声学建模通过单个RNN来学习预测上下文独立的目标（如音素或字符）。通过使用连接主义时间分类（CTC）目标函数，系统能够自动生成语音和标签序列之间的对齐，无需预先生成帧标签。此外，EESEN的一个显著特点是其通用的WFST解码方法，这使得解码过程更高效。" 本文介绍的EESEN是一个端到端的语音识别系统，它极大地简化了传统ASR系统的构建过程。在传统的ASR系统中，通常需要多个阶段的训练、丰富的资源以及专业的技术知识。而EESEN则通过深度学习技术，特别是RNN，实现了这一过程的自动化和优化。深度循环神经网络是EESEN的核心组件，它们用于处理时间序列数据，如语音信号。RNN通过学习输入序列的动态特性来预测上下文独立的目标，这些目标可以是音素或者字符。CTC目标函数的引入解决了无监督对齐问题，即系统能够在没有预先标注的帧级标签情况下，自动学习语音和文本之间的对应关系。这种无监督的对齐方式极大地简化了训练流程，使得模型可以直接从原始音频中学习。此外，EESEN的另一大创新在于其WFST为基础的解码策略。WFST是一种强大的数学工具，常用于构建和执行复杂的转换任务，如语言模型的集成、发音词典的处理等。在EESEN中，WFST被用来实现一个通用的解码器，它能有效地结合语言模型和其他先验知识，提高识别的准确性。这种方法不仅灵活，而且可以适应各种复杂的ASR应用场景，如在线语音识别、多语言识别等。通过使用这些技术，EESEN能够在保持高性能的同时，降低构建ASR系统的复杂性，使得研究人员和开发者更容易地实现和优化他们的语音识别系统。总体来说，EESEN的出现推动了ASR技术的发展，使得端到端的语音识别变得更加实用和可访问。

资源详情

资源推荐

EESEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND

WFST-BASED DECODING

Yajie Miao, Mohammad Gowayyed, Florian Metze

Language Technologies Institute, School of Computer Science, Carnegie Mellon University

ABSTRACT

The performance of automatic speech recognition (ASR) has

improved tremendously due to the application of deep neu-

ral networks (DNNs). Despite this progress, building a new

ASR system remains a challenging task, requiring various

resources, multiple training stages and signiﬁcant expertise.

This paper presents our Eesen framework which drastically

simpliﬁes the existing pipeline to build state-of-the-art ASR

systems. Acoustic modeling in Eesen involves learning a

single recurrent neural network (RNN) predicting context-

independent targets (phonemes or characters). To remove the

need for pre-generated frame labels, we adopt the connection-

ist temporal classiﬁcation (CTC) objective function to infer

the alignments between speech and label sequences. A dis-

tinctive feature of Eesen is a generalized decoding approach

based on weighted ﬁnite-state transducers (WFSTs), which

enables the efﬁcient incorporation of lexicons and language

models into CTC decoding. Experiments show that com-

pared with the standard hybrid DNN systems, Eesen achieves

comparable word error rates (WERs), while at the same time

speeding up decoding signiﬁcantly.

Index Terms— Recurrent neural network, connectionist

temporal classiﬁcation, end-to-end ASR

1. INTRODUCTION

Automatic speech recognition (ASR) has traditionally lever-

aged the hidden Markov model/Gaussian mixture model

(HMM/GMM) paradigm for acoustic modeling. HMMs act

to normalize the temporal variability, whereas GMMs com-

pute the emission probabilities of HMM states. In recent

years, the performance of ASR has been improved dramat-

ically by the introduction of deep neural networks (DNNs)

as acoustic models [1, 2, 3]. In the hybrid HMM/DNN

approach, DNNs are used to classify speech frames into clus-

tered context-dependent (CD) states (i.e., senones). On a

variety of ASR tasks, DNN models have shown signiﬁcant

gains over the GMM models. Despite these advances, build-

ing a state-of-the-art ASR system remains a complicated,

expertise-intensive task. First, acoustic modeling typically

requires various resources such as dictionaries and phonetic

questions. Under certain conditions (e.g., in low-resource lan-

guages), these resources may be unavailable, which restricts

or delays the deployment of ASR. Second, in the hybrid

approach, training of DNNs still relies on GMM models to

obtain (initial) frame-level labels. Building GMM models

normally goes through multiple stages (e.g., CI phone, CD

states, etc.), and every stage involves different feature pro-

cessing techniques (e.g., LDA, fMLLR, etc.). Third, the

development of ASR systems highly relies on ASR experts

to determine the optimal conﬁgurations of a multitude of

hyper-parameters, for instance, the number of senones and

Gaussians in the GMM models.

Previous work has made various attempts to reduce the

complexity of ASR. In [4, 5], researchers propose to ﬂat-start

DNNs and thus get ride of GMM models. However, this

GMM-free approach still requires iterative procedures such

as generating forced alignments and decision trees. Mean-

while, another line of work [6, 7, 8, 9, 10] has focused on

end-to-end ASR, i.e., modeling the mapping between speech

and labels (words, phonemes, etc.) directly without any in-

termediate components (e.g., GMMs). On this aspect, Graves

et al. [11] introduce the connectionist temporal classiﬁcation

(CTC) objective function to infer speech-label alignments au-

tomatically. This CTC technique is further investigated in

[6, 7, 8, 12] on large-scale acoustic modeling tasks. Although

showing promising results, research on end-to-end ASR faces

two major obstacles. First, it is challenging to incorporate

lexicons and language models into decoding. When decod-

ing CTC-trained models, past work [6, 8, 10] has success-

fully constrained search paths with lexicons. However, how

to integrate word-level language models efﬁciently still is an

unanswered question [10]. Second, the community lacks a

shared experimental platform for the purpose of benchmark-

ing. End-to-end systems described in the literature differ not

only in their model architectures but also in their decoding

methods. For example, [6] and [8] adopt two distinct ver-

sions of beam search for decoding CTC models. These setup

variations hamper rigorous comparisons not only across end-

to-end systems, but also between the end-to-end and existing

hybrid approaches.

In this paper, we resolve these issues by presenting and

publicly releasing our Eesen framework. Acoustic model-

ing in Eesen is viewed as a sequence-to-sequence learning

problem. We exploit deep recurrent neural networks (RNNs)

arXiv:1507.08240v1 [cs.CL] 29 Jul 2015

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_44276261

粉丝: 1
资源: 49

使用深度RNN模型和WFST解码的EESEN端到端语音识别

eesen：Eesen项目的官方存储库

【10】Towards End-to-End Speech Recognitionwith Recurrent Neural Networks.pdf

【12】Deep speech 2 End-to-end speech recognition in english and mandarin.pdf

近三年深度学习掌纹分类方法有什么，至少10个

2020年到2023年深度学习掌纹分类方法有什么，至少10个

语音识别有什么最新的论文？

GMM-HMM语音识别matlab手写源码

帮我找找基于cnn的车牌号识别系统教程，需要有代码

两篇英文文献关于语音识别

HCI Experiment Speech Recognition

speech recognition api

CNN语音识别推荐文件

asrt_speechrecognition-master

ASRPRO语音控制系统参考文献有哪些

speechrecognition库中文

VB6 Microsoft Speech Object Library

js如何检测h5 SpeechRecognition浏览器是否支持

语音降噪联合语音识别训练的相关主流文献和开源代码有哪些

python SpeechRecognition 怎么使用

energy-based model

最新资源