基于R-CNN的频谱图语音识别提升鲁棒性与效率

93 浏览量更新于2024-08-28 收藏 515KB PDF 举报

基于频谱图的语音识别是一种利用深度学习技术改进传统语音识别方法的创新策略。当前，人工智能领域的深度学习已经在诸如图像分类和自然语言处理等领域取得了显著的进步，然而在语音识别方面，尤其是在时域处理中，其鲁棒性问题仍然存在。传统的时域语音识别往往难以处理噪声干扰，导致识别精度受到限制。为了克服这一挑战，论文提出了一种结合了快速区域卷积神经网络（faster R-CNN）的目标检测算法。faster R-CNN是一种先进的计算机视觉模型，特别适用于在图像中定位和识别物体，它的优点在于同时进行物体定位和分类，提高了识别效率。在语音识别的上下文中，这种方法被用来在时域和频域两个维度上对频谱图进行分析，这有助于捕捉到关键的语音特征，如声纹，这些特征对于识别至关重要。研究者注意到，频谱图中的局部感兴趣区域（即明显的声纹部分）包含了丰富的语音信息，而高频噪声通常不包含语音特征。因此，提出的算法重点聚焦于这些区域，通过过滤掉高频噪声，有效地提升了识别系统的性能和鲁棒性。这种方法的优势在于它能够更精确地定位语音信号，减少背景噪音的影响，从而提高在嘈杂环境，如工厂等复杂声音背景下的识别准确率。实验结果表明，与现有的语音识别技术相比，基于faster R-CNN的频谱图识别方法具有明显的优势。它不仅在准确性上有所提升，而且在面对各种噪声条件下仍能保持较好的稳定性和可靠性。这对于实际应用，如智能家居、智能汽车等场景中的语音交互系统，具有重要的实用价值。基于频谱图的语音识别通过深度学习和目标检测技术的融合，为解决语音识别中的鲁棒性问题提供了一种有效策略。这种方法的应用不仅优化了语音识别过程，也推动了人工智能在语音处理领域的进一步发展。随着技术的不断进步，这种结合频谱分析和深度学习的策略有望在未来实现更高水平的语音识别性能。

Speech Recognition Method based on Spectrogram

Yingying Li

1,*

, Siyuan Pi, Nanfeng Xiao

School of Computer Science & Engineering, South China University of Technology,

GuangZhou, 510006

*E-mail: Crystaliyy@foxmail.com

Abstract. Deep learning makes a great breakthrough in the field of artificial

intelligence. Currently, the robustness of the speech recognition in time domain

performs poorly, and the spectrogram complexity of the speech recognition in

frequency domain also needs to be reduced greatly. Therefore, this paper

presents a faster R-CNN-based target detection method to recognize the

spectrogram for the speech recognition in the time and frequency domain. The

presented method only focuses on the local interest regions (obvious voiceprint)

of the spectrogram, which filters the high frequency noise to improve

performance. The experimental results show that the presented method has

higher accuracy and robustness than existing methods, and which can perform

well evenly in the noisy factory.

Keywords: speech recognition, spectrogram, target detection, faster R-CNN

1 Introduction

Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) has been

dominant in automatic speech recognition (ASR)

[1]

. Even today, many speech

recognition systems still use HMM to process the speech temporary variables and

adopt GMM to confirm the states produced by HMM

[2]

. While in recent years,

because of the powerful feature extraction capabilities and modeling capabilities of

deep neural networks (DNNs)

[2]

, they begin to replace gradually the traditional GMM

to calculate the output probability

[3, 4]

and combine with GMM to compose DNN-

HMM.

However, the speech signals are non-stationary process with the range of time and

frequency

[1]

. HMM models of the speech recognition systems have poor robustness,

and they mainly focus on the analysis of the time dimension. Although their

performances are very well in the noiseless environments, the performances are still

very poor in the noisy environments. Since there are great differences between the

human voices and the noises, it is a good choice in the speech recognition that

recognizes the spectrogram in the time domain and the frequency domain. Generally,

the spectrogram is a short time Fourier transformation (STFT)

[5]

, the STFT must

transform each original speech frame, and recognize the time domain and the

frequency domain by convolutional neural network (CNN)

[6]

, such as deep

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38539018

粉丝: 6

基于R-CNN的频谱图语音识别提升鲁棒性与效率

MFC GDI 自绘音量柱显示控件

语音三维语谱图(matlab实现)

语音的读入以及绘出语音信号的波形频谱图

基于tensorflow的语音识别

基于频谱图和局部二值模式的说话人识别

基于HMM的语音识别的matlab实现

基于LabVIEW的语音识别设计方案+附源代码

基于FPGA的语音识别拨号系统的设计与实现.pdf

基于基于FrFT的频谱图和RBF神经网络的说话人识别

基于HannFFT的快速高效语音识别频谱图MATLAB实现

最新资源