KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 8, NO. 2, Feb. 2014 487
Copyright ⓒ 2014 KSII
both RGB and depth channels? Secondly, since RGB and depth images represent one scene in
different modalities, they are complementary to each other and it will benefit human action
recognition by fusing both for discriminative feature representation and model construction.
In fact, in different research domains, the fusion of multi-modalities features or multi-view
features have attracted the attentions of many scientists. For example, in web image search
[18-20], video semantic annotation or tagging [21-24], 3D Object Retrieval [25-28], target
tracking [29] and multi-view object classification [30-34], authors had discussed the
importance of fusion of multi-modalities features or multi-view features, and experiments also
showed its was very helpful for the tasks in different research domains. Thus, we will first
assess the performances when these descriptors in RGB and Depth channels are combined.
Further, with the features from multiple modality resources, we also propose a collaborative
multi-task learning based on transfer learning for human action recognition to assess the
importance of the fusion of multi-modality features.
In addition, for the algorithms evaluation, most above algorithms are just assessed by a
kind of classification model, but it is not adequate. For example, after extracting different kind
of features, all researchers [3-6,17] adopt SVM models to recognize human action;
Approximate string matching [9] and graph model [10] are employed to identify human
motion; In Bobick and Davis [1], similarity matching schemes were employed; What is worse,
most current methods are highly dependent on dataset and therefore the generalization ability
is severely constrained. To solve this problem, some authors have proposed a model-free
method for human action recognition via sparse representation. For example, Authors [35-42]
extracted different kind of features for each action, and then employed sparse representation
based classification algorithm directly without any changing. SRC [41] has been proposed
firstly for face recognition, in which a testing sample is reconstructed and represented by all
the training samples, after that, impulse function is designed for each class and representation,
and then the minimum representation error is adopted to classify the testing sample. Similar to
SRC, the philosophy of the proposed method in [35-40] is to decompose each video sample
containing one kind of human actions as a
sparse linear combination of several video
samples containing multiple kinds of human actions, and it has achieved good performance.
The reason of obtaining success is that the point’s neighborhood structure is utilized fully, and
can supply better similarity measures among the testing data and all the training samples. After
that, Zhang et al. [41] discussed the role of
-norm and
-norm respectively, and then
concluded that the sparsity in SRC was not so important, and collaborative representation
played much more important roles. Thus, what will be happened when these descriptors are
assessed by mode-free models and traditional, constrained classification algorithms depended
on dataset?
3. Motion History Image for RGB and Depth Modalities
In order to represent human motion, human silhouettes of each frame need to be accumulated
and encoded firstly, thus, we construct human motion maps for RGB and depth channel
respectively, and the details will be given as follows.
3.1 MHI for RGB Modality
To describe human motion, motion history images (MHI) [1], where moving human
silhouettes are accumulated and encoded, has been widely employed, and achieved good
performance. However, Bobick and Davis [1], firstly detected or segmented targets in RGB