Jin-Gong Jia et al.: Two-Stream TCNs for Skeleton-Based Human Action Recognition 539
the lack of depth channel in the source images/videos.
With the innovation of 3D data acquisition technology,
RGB-D data has become popular in recent years, which
makes it possible to infer the motion sequence of a skele-
tal joint in the 3D space. For example, Sho ll et al.
[9]
proposed an algorithm for obtaining human skeletons in
real time with a depth sensor. Wang et al.
[10]
also pro-
posed an efficient and robust human pose estimation
algorithm on RGB videos. Significant advances have
been made in human action recognition based on RGB
and RGB-D data
[11, 12]
. With the increasing availa-
bility of skeleton acquisitio n tools, research on human
action recognition using skeleton data has generated
growing interest.
In this paper, we simultaneously consider the spatial
and temporal changes of the human skeleton and pro-
pose a more powerful learning model to capture skele-
ton variability in both spatial and temporal dimensions.
Most existing methods lack the ability to extract the
spatiotemporal feature representations. In such meth-
ods, it is often difficult to extract a single feature rep-
resentation that can be used to recog niz e all action
classes. Desig ning a model with a greater learning abi-
lity for spatiotemporal feature representations is a lso
a key problem in human action recognition. Previous
methods for identifying human actions are mainly based
on convolutional neural networks (CNNs)
[13–15]
, recur-
rent neural networks (RNNs)
[16–19]
, or graph convolu-
tional networks (GCNs)
[20–23]
. Typically, these meth-
ods only consider a single feature r epresentation of the
human body. In recent years, the temporal convolu-
tional networks (TCNs)
[24, 25]
have shown outstanding
ability in processing time sequence data, and exten-
sive experiments have shown that TCNs are superior
to RNNs such as Long Short Term Memory networks
(LSTMs). Based on TCNs, des igning a multi-channel
network model that le arns multiple feature represen-
tations simultaneously can improve the accuracy of hu-
man action recognition. We consider two important fea-
ture representations in the new network, i.e., the move-
ments of each skeletal joint between two adjacent ac-
tion frames and the relative positions of the constituent
joints in a single skeletal frame. The main contributions
of our work include the followings.
• We propose a novel method that leverages both
the inter-frame vector feature representation between
adjacent frames and the intra-frame vector feature rep-
resentation within a single frame. Experiments show
that these two vector feature representations play the
role of mutual promotion in recognition of many action
classes.
• We redesig n residual blocks for TCNs and pro-
pose the two-stream temp oral convolutional networks
(TS-TCNs) that can integrate multiple feature repre-
sentations to bring notable improvement in recognition
performance.
• We perform a comprehe nsive experimental val-
idation using four widely well-known datasets: NTU
RGB+D
[11]
, NTU RGB+D 120
[26]
, Northwestern-
UCLA
[27]
, and UTKinect-Action
[28]
. Our results s how
the proposed two-stream network achieves superior per -
formance compar ed with most previous methods.
2 Related Work
In this section, we review relevant literature on hu-
man action recognition. First, we present methods for
extracting the dynamics feature represe ntation of hu-
man actions. We then describe network-based models
to process skeleton sequences for human action recog-
nition.
2.1 Dynamics Representation
The human action recognition task consists in iden-
tifying human body behaviors from sequence data such
as images, videos, and skeletons. The ma in contents
of action behaviors include gestures, actions in da ily
life, interaction and group activities. Early research on
human action recognition focused on still images and
videos
[5, 12]
. RGB data is rich in color, shape, and
texture fea tures. Initial methods for action recogni-
tion mainly use the color and texture information in
2D images. However, va rious factors, such as back-
ground clutter and human body occlusion, make this
identification task co mplicated. Liu et al.
[29]
proposed
a method based on deep learning that uses depth se-
quences and the corresponding skeleton joint informa-
tion. Since depth images lack information such as color
and texture, related work based on depth maps is lim-
ited. Wang et al.
[30]
proposed a method using RGB and
depth features to coordinate training for action recog-
nition. Skeleton da ta, which has obvious advantages
over RGB and depth data, contains 3 D informa tion
on the joint points of the human body and thus pro-
vides higher-level geometric features. Wang et al.
[31]
developed an action ensemble model that characterizes
the conjunctive structure of 3D human actions by cap-
turing the correlations of the joints. Zhang et al.
[32]
introduced a related geometric feature on joints and