Multimed Tools Appl
the multi-class classification task to two-class classification. The complexity of the design
and the requirements on computational resources of 3D-CNN are reduced at the same time.
However, the storage of the network parameters may be increased as more 3D-CNNs are
included.
The contributions of this paper are: 1) Parallel 3D-CNNs are proposed to be used for
multi-class classification. In the proposed parallel structure, each 3D-CNN is used as a two-
class classifier for one specific video class. It makes the training of 3D-CNN much easier
and reduces the requirement of calculation so that the parallel 3D-CNN can be implemented
on the computer with normal configurations on CPU, GPU, and Memory, and achieve
good enough performance. 2) The temporally downsampled versions of videos are used to
increase the volume of dataset and the number of positive training data. During the train-
ing of 3D-CNN, each video is downsampled into sub-videos with the fixed interval, and
these sub-videos can represent the video in a low temporal resolution. Here, downsampling
not only helps to increase especially the number of positive data, which can guarantee the
performance of 3D-CNN, but also makes it possible to classify videos based only on their
sub-videos, which can make the input of video classification light-loaded. 3) The proposed
parallel structure can grow with the increment of new class. Each 3D-CNN in the proposed
model recognizes whether the input video belongs to its class according to the defined
threshold. If a video belongs to none of the existing classes, the video will be classified as
a new class and an additional 3D-CNN for the new class is constructed in the same way
as each of the existing 3D-CNNs. The feasibility of the proposed parallel 3D-CNN model
in video classification is verified through its application on video copy detection within the
CC
WEB VIDEO dataset.
2 Parallel 3D-CNNs
2.1 3D-CNN
3D-CNN is proposed by Ji et al for action recognition in [6]. 3D-CNN can extract features
from video input streams directly, which is good at derive local motion information from
the video. There is a hardwired layer after input video in the 3D CNN in [6], which is set to
generate multiple channels of information from the input frames, including gray, gradient-
x, gradient-y, optflow-x, and optflow-y. These features are mainly on motion or differences
caused by motion. However for video classification, only these features are not enough,
so we keep the main structure of the 3D-CNN presented in [6] except removing its hard-
wired layer, and use video frames as the input of the first convolutional layer to allow more
possible information be analyzed. The 3D-CNN model used in this paper is shown in Fig. 1.
The feature map of convolutional layer is defined as:
f
xyz
ij
= sigm
⎛
⎝
b
ij
+
n
P
i
−1
p=0
Q
i
−1
q=0
R
i
−1
r=0
w
pqr
ij n
f
(x+p)(y+q)(z+r)
(i−1)n
⎞
⎠
(1)
where f
xyz
ij
is the value of the point
(
x,y,z
)
of the jth feature map at the ith layer, sigm(
) is sigmoid function and b
ij
is the bias of the jth feature map at the ith layer. w
pqr
ij n
is
the
(
p, q, r
)
th value of the kernel connected to the nth feature map in the previous layer.
(P
i
,Q
i
,R
i
) is the kernel size of ith layer.