Statistical Machine Learning vs Deep Learning in Information Fusion:
Competition or Collaboration?
Ling Guan, Lei Gao, Nour El Din Elmadany
Ryerson University, Toronto, Canada
lguan@ryerson.ee.ca; iegaolei@gmail.com; nourelmadany@gmail.com
Chengwu Liang
Zhengzhou University, Zhengzhou, China
cliang@ee.ryerson.ca; or liangchengwu0615@126.com
Abstract
Information fusion is the process of coherently and intel-
ligently combining knowledge extracted from different sen-
sors/modalities, in order to obtain more useful or discrimi-
nant information for the purpose of multimedia processing
and biometrics, among others. The key to successful infor-
mation fusion is to intelligently exploit the intrinsic relation-
s between the data of different modalities. Statistical ma-
chine learning (SML) has played a major role in develop-
ing new information fusion methods, by incorporating prior
knowledge and entropy metric, correlation analysis, inher-
ent statistical structures of input data, and nonlinear rela-
tions. On the other hand, the recent development of deep
learning (DL) draws enormous attention from the machine
learning community. DL algorithms possess deep struc-
tures, requiring a large amount of data to train the huge
number of parameters, an ultra-expensive process. Howev-
er, the payoff is enormous; unprecedented success in many
applications. This paper will first review recent develop-
ment of both SML and DL in the context of information fu-
sion, then analyze their pros and cons, and compare their
performance in a number of application domains. Based
on preliminary results, some thoughts will be presented on
how SML and DL can work together to bring the study in
machine learning to the next level, better serving human
needs.
1 Introduction
Information is obtained through different types of ac-
quisition techniques and multiple sources. The availabili-
ty of such multimodal information has been growing with
extremely fast pace[1]. Therefore, information fusion of
multiple sources is becoming an increasingly important re-
search topic for multimedia analysis, pattern recognition,
computer vision and biometrics[2].
In general, natural integration of multiple media, their
associated features, and the intermediate representation or
decisions are referred to as information fusion[3]. Com-
monly, there are three levels of information fusion: data
level, feature/representation level and decision level.
Among the three levels of information fusion, da-
ta/feature level fusion has drawn significant attention from
the research communities of multimedia and biometrics due
to its capacity of information preservation and impressive
progress has been made[4]. Statistical machine learning
(SML) based approaches have stood out[5][6]. Among
them, Bayesian networks (BNs)[7], correlation based meth-
ods, and discriminative analysis have been the mainstreams,
since they are able to handle uncertainties by modeling the
dependencies, and describe the domain knowledge mathe-
matically in a graphical structure.
For the recognition tasks, correlation based methods,
e.g., Canonical Correlation analysis (CCA)[8] and its dis-
criminative version Discriminative CCA (DCCA)[9], and
Multi-set CCA (MCCA)[10] were proposed to identify the
correlation and discriminative information of different fea-
ture streams for visual recognition. After that, discrimina-
tive MCCA (DMCCA) is presented[11, 12] for performance
enhancement.
One of the prominent recent studies of SML is in the
area of biometric applications and video based human ac-
tion recognition [13][14]. With the release of cost-effective
sensors such as Kinect RGB-Depth camera, there has been
an increasing interest in developing new models and meth-
ods for recognizing actions with multimodal information,
such as sub-action segmentation and feature coding by the
discriminative locality-constrained affine subspace coding
method[15].
On the other hand, deep learning (DL) based method-
s, such as Convolutional neural networks (CNN), Recur-
rent neural networks (RNN) and Long Short Term Memo-
ry networks (LSTM) have dramatically improved the state-
of-the-art in visual object recognition, object detection, ac-
tion recognition and other applications[16, 17, 18]. Among
these methods, CNN is one of the most notable approaches.
It has been found highly effective and is also the most com-
monly used in diverse computer vision applications. Chan
et al.[17] proposed a new deep learning architecture named
principal components analysis network (PCANet) whose
251
2018 IEEE Conference on Multimedia Information Processing and Retrieval
0-7695-6354-6/18/$31.00 ©2018 IEEE
DOI 10.1109/MIPR.2018.00059