深度异构特征融合提升视听说话人识别效率

研究论文

110 浏览量更新于2024-08-27 1 收藏 1.02MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

Eﬃcient Audio-Visual Speaker Recognition

via Deep Heterogeneous Feature Fusion

Yu-Hang Liu

1,2

,XinLiu

1,2(

)

, Wentao Fan

1,2

Bineng Zhong

1,2

, and Ji-Xiang Du

1,2

Department of Computer Science, Huaqiao University, Xiamen 361021, China

xliu@hqu.edu.cn

Xiamen Key Laboratory of Computer Vision and Pattern Recognition,

Huaqiao University, Xiamen 361021, China

Abstract. Audio-visual speaker recognition (AVSR) has long been an

active research area primarily due to its complementary information for

reliable access control in biometric system, and it is a challenging prob-

lem mainly attributes to its multimodal nature. In this paper, we present

an eﬃcient audio-visual speaker recognition approach via deep heteroge-

neous feature fusion. First, we exploit a dual-branch deep convolutional

neural networks (CNN) learning framework to extract and fuse the high-

level semantic features of face and audio data. Further, by considering

the temporal dependency of audio-visual data, we embed the fused fea-

tures into a bidirectional Long Short-Term Memory (LSTM) networks

to produce the recognition result, though which the speakers acquired

under diﬀerent challenging conditions can be well identiﬁed. The experi-

mental results have demonstrated the eﬃciency of our proposed approach

in both audio-visual feature fusion and speaker recognition.

Keywords: Audio-visual speaker recognition

· Deep heterogeneous

feature fusion

· Dual-branch deep CNN · Bidirectional LSTM

1 Introduction

Multi-modal biometric person recognition has received a lot of attention in recent

years due to the growing security demands in commercial and law enforcement

applications. In particular, speaker recognition is one of the active research prob-

lems in biometric community, and audio-visual (AV) biometrics generally oﬀer

complementary information sources for speaker identity characterization. Among

them, face and voice features, incorporating the advantages of non-intrusiveness

and easy acquisitions, have become economically feasible, but the appropriate

fusion between these two heterogeneous modalities is still a non-trivial task.

In the past, diﬀerent kinds of approaches have been exploited to fuse the face

and voice data. In general, the audio-visual integration can be divided into four

categories: sensor-level, feature-level, matching-level and decision-level. Since the

sensor-level based fusion approaches require that the input data types must be

 Springer International Publishing AG 2017

J. Zhou et al. (Eds.): CCBR 2017, LNCS 10568, pp. 575–583, 2017.

https://doi.org/10.1007/978-3-319-69923-3

_62

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38668274

粉丝: 2
资源: 937

深度异构特征融合提升视听说话人识别效率

联合SENet异构层特征融合与集成学习的材质图像识别.pdf

多源异构数据融合技术

异构数据融合、同构数据融合、多源数据融合的概念与区别

多源异构数据融合方法mcs-rf

基于ns2的异构网络融合仿真

多源异构数据融合算法

多源异构数据融合方法 concat

深度学习异构的应用实例

多源异构探测数据融合技术路线

异构数据融合技术是什么

深度学习异构系统通信

领域知识融合深度学习

怎么通过深度学完完成异构数据的匹配问题

怎么通过深度学习完成异构数据的匹配问题

基于深度学习的遥感图像融合技术

人工智能对多源异构表征

基于多源异构数据集的多重识别技术

异构数值数据的知识提取与融合

深度学习与多模态数据处理：融合

电流互感器监测数据的特征融合存在哪些难点

最新资源