没有合适的资源?快使用搜索试试~ 我知道了~
首页深度异构特征融合提升视听说话人识别效率
深度异构特征融合提升视听说话人识别效率
1 下载量 110 浏览量
更新于2024-08-27
1
收藏 1.02MB PDF 举报
本文主要探讨了"通过深度异构特征融合实现高效的视听说话人识别"这一关键课题。音频-视觉(AVSR)说话人识别是生物识别系统中可靠身份验证的重要技术,但由于其涉及多模态数据处理,它一直是一个具有挑战性的研究领域。作者提出了一种新颖且高效的解决方案,即利用深度异构特征融合。 首先,他们构建了一个双分支的深度卷积神经网络(CNN)架构,该架构旨在同时提取和融合面部和音频数据的高级语义特征。这一步骤对于理解不同模态的信息互补性至关重要,因为面部和声音提供了说话人身份的不同维度线索。通过这种方式,模型能够更准确地捕捉到个体在视觉和听觉上的独特标识。 接着,为了进一步考虑音频-视觉数据的时间依赖性,作者将融合后的特征嵌入到双向长短期记忆(LSTM)网络中。LSTM网络以其在处理序列数据时保留长期依赖性的优势,有助于捕捉说话人在不同时间点的一致性和变化模式。这种方法使得模型能够捕捉到说话人的动态特征,增强了识别的稳定性和准确性。 最后,通过这种深度异构特征融合和时空建模相结合的方式,作者的目标是提升视听说话人识别系统的整体性能,减少误识率,并在实际应用中提供更加可靠的身份验证。这种方法的研究对于改善当前AVSR系统的效能,特别是在嘈杂环境或遮挡条件下,具有显著的实际意义。 本文的核心贡献在于提出了一种创新的深度学习框架,旨在解决音频-视觉说话人识别中的复杂问题,并通过融合和利用不同模态的数据特性,优化了识别过程,为未来的生物识别技术和安全系统提供了新的研究方向。
资源详情
资源推荐
Efficient Audio-Visual Speaker Recognition
via Deep Heterogeneous Feature Fusion
Yu-Hang Liu
1,2
,XinLiu
1,2(
B
)
, Wentao Fan
1,2
,
Bineng Zhong
1,2
, and Ji-Xiang Du
1,2
1
Department of Computer Science, Huaqiao University, Xiamen 361021, China
xliu@hqu.edu.cn
2
Xiamen Key Laboratory of Computer Vision and Pattern Recognition,
Huaqiao University, Xiamen 361021, China
Abstract. Audio-visual speaker recognition (AVSR) has long been an
active research area primarily due to its complementary information for
reliable access control in biometric system, and it is a challenging prob-
lem mainly attributes to its multimodal nature. In this paper, we present
an efficient audio-visual speaker recognition approach via deep heteroge-
neous feature fusion. First, we exploit a dual-branch deep convolutional
neural networks (CNN) learning framework to extract and fuse the high-
level semantic features of face and audio data. Further, by considering
the temporal dependency of audio-visual data, we embed the fused fea-
tures into a bidirectional Long Short-Term Memory (LSTM) networks
to produce the recognition result, though which the speakers acquired
under different challenging conditions can be well identified. The experi-
mental results have demonstrated the efficiency of our proposed approach
in both audio-visual feature fusion and speaker recognition.
Keywords: Audio-visual speaker recognition
· Deep heterogeneous
feature fusion
· Dual-branch deep CNN · Bidirectional LSTM
1 Introduction
Multi-modal biometric person recognition has received a lot of attention in recent
years due to the growing security demands in commercial and law enforcement
applications. In particular, speaker recognition is one of the active research prob-
lems in biometric community, and audio-visual (AV) biometrics generally offer
complementary information sources for speaker identity characterization. Among
them, face and voice features, incorporating the advantages of non-intrusiveness
and easy acquisitions, have become economically feasible, but the appropriate
fusion between these two heterogeneous modalities is still a non-trivial task.
In the past, different kinds of approaches have been exploited to fuse the face
and voice data. In general, the audio-visual integration can be divided into four
categories: sensor-level, feature-level, matching-level and decision-level. Since the
sensor-level based fusion approaches require that the input data types must be
c
Springer International Publishing AG 2017
J. Zhou et al. (Eds.): CCBR 2017, LNCS 10568, pp. 575–583, 2017.
https://doi.org/10.1007/978-3-319-69923-3
_62
下载后可阅读完整内容,剩余8页未读,立即下载
weixin_38668274
- 粉丝: 2
- 资源: 937
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 李兴华Java基础教程:从入门到精通
- U盘与硬盘启动安装教程:从菜鸟到专家
- C++面试宝典:动态内存管理与继承解析
- C++ STL源码深度解析:专家级剖析与关键技术
- C/C++调用DOS命令实战指南
- 神经网络补偿的多传感器航迹融合技术
- GIS中的大地坐标系与椭球体解析
- 海思Hi3515 H.264编解码处理器用户手册
- Oracle基础练习题与解答
- 谷歌地球3D建筑筛选新流程详解
- CFO与CIO携手:数据管理与企业增值的战略
- Eclipse IDE基础教程:从入门到精通
- Shell脚本专家宝典:全面学习与资源指南
- Tomcat安装指南:附带JDK配置步骤
- NA3003A电子水准仪数据格式解析与转换研究
- 自动化专业英语词汇精华:必备术语集锦
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功