The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)
Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition
Bin Li,
1
Xi Li,
2∗
Zhongfei Zhang,
1
Fei Wu
2
1
College of Information Science & Electronic Engineering, Zhejiang University, Hangzhou, China
2
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
{bin li, xilizju, zhongfei}@zju.edu.cn wufei@cs.zju.edu.cn
Abstract
With the representation effectiveness, skeleton-based human
action recognition has received considerable research at-
tention, and has a wide range of real applications. In this
area, many existing methods typically rely on fixed physical-
connectivity skeleton structure for recognition, which is in-
capable of well capturing the intrinsic high-order correla-
tions among skeleton joints. In this paper, we propose a novel
spatio-temporal graph routing (STGR) scheme for skeleton-
based action recognition, which adaptively learns the in-
trinsic high-order connectivity relationships for physically-
apart skeleton joints. Specifically, the scheme is composed
of two components: spatial graph router (SGR) and tempo-
ral graph router (TGR). The SGR aims to discover the con-
nectivity relationships among the joints based on sub-group
clustering along the spatial dimension, while the TGR ex-
plores the structural information by measuring the correla-
tion degrees between temporal joint node trajectories. The
proposed scheme is naturally and seamlessly incorporated
into the framework of graph convolutional networks (GCNs)
to produce a set of skeleton-joint-connectivity graphs, which
are further fed into the classification networks. Moreover, an
insightful analysis on receptive field of graph node is pro-
vided to explain the necessity of our method. Experimental
results on two benchmark datasets (NTU-RGB+D and Kinet-
ics) demonstrate the effectiveness against the state-of-the-art.
Introduction
As a challenging problem in computer vision, skeleton-
based human action recogntion takes 3d human body co-
ordinates as input and outputs action class, which attracts
increasing attention recently (Wang et al. 2018b). Typically,
human body skeletons characterize the geometric body con-
figuration as rigid body, and their dynamics capture mo-
tion patterns in a continuous way. This dynamic geomet-
ric structure expresses relation among the joints not only
spatially but also temporally. By this means, graph repre-
sentation is the natural way to express the intrinsic human
structure. Therefore, it is crucial to automatically represent
joints on the given graph. Recent success of Spatial Tem-
poral Graph Convolution Networks (ST-GCN) (Yan, Xiong,
and Lin 2018) has justified the effectiveness by a graph
∗
Corresponding author: Xi Li
Copyright
c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
(a)
(b)
(c)
correlated
sub-group
neighbour
Layer 𝐿
Layer 𝐿 + 1
Figure 1: Illustraion of three routing ways: (a) fixed routing
by physical connections; (b) spatial routing by considering
local clustering; (c) temporal routing by modeling the corre-
lation degrees of node trajectories.
aggregation scheme with physical human skeleton, against
the existing literatures such as pseduo images (Wang et al.
2018a; Xie et al. 2018), variants of LSTM (Shahroudy et al.
2016; Song et al. 2017; Liu et al. 2017).
In general, the graph-based method applies a fixed hu-
man skeleton to graph convolution operation and iteratively
aggregates the hidden feature with neighbourhood features.
However, it is challenging to capture changeable human
structure in complex scene. This brings three-fold problems
for further improvement: 1) The skeleton itself is change-
able and depends on specific dataset, e.g., 25 joints in NTU-
RGB+D (Shahroudy et al. 2016) while 18 joints in Kinet-
ics (Kay et al. 2017), resulting in confusion on real human
skeleton; 2) The joint connections are highly unbalanced.
While torso joints become over-smoothing, limb joints may
still be under-smoothing, which causes extreme difficulty on
feature sharing for two limb joints; 3) A global graph struc-
ture is applied to each sample, raising the question “one size
8561