Revisiting Skeleton-based Action Recognition
Haodong Duan
1,3
Yue Zhao
2
Kai Chen
3,5
Dahua Lin
1,3
Bo Dai
3,4
1
The Chinese University of HongKong
2
The University of Texas at Austin
3
Shanghai AI Laboratory
4
S-Lab, Nanyang Technological University
5
SenseTime Research
Abstract
Human skeleton, as a compact representation of hu-
man action, has received increasing attention in recent
years. Many skeleton-based action recognition methods
adopt GCNs to extract features on top of human skele-
tons. Despite the positive results shown in these attempts,
GCN-based methods are subject to limitations in robust-
ness, interoperability, and scalability. In this work, we pro-
pose PoseConv3D, a new approach to skeleton-based ac-
tion recognition. PoseConv3D relies on a 3D heatmap vol-
ume instead of a graph sequence as the base representa-
tion of human skeletons. Compared to GCN-based methods,
PoseConv3D is more effective in learning spatiotemporal
features, more robust against pose estimation noises, and
generalizes better in cross-dataset settings. Also, PoseC-
onv3D can handle multiple-person scenarios without ad-
ditional computation costs. The hierarchical features can
be easily integrated with other modalities at early fusion
stages, providing a great design space to boost the perfor-
mance. PoseConv3D achieves the state-of-the-art on five
of six standard skeleton-based action recognition bench-
marks. Once fused with other modalities, it achieves the
state-of-the-art on all eight multi-modality action recog-
nition benchmarks. Code has been made available at:
https://github.com/kennymckormick/pyskl.
1. Introduction
Action recognition is a central task in video understand-
ing. Existing studies have explored various modalities for
feature representation, such as RGB frames [6, 54, 59], op-
tical flows [47], audio waves [62], and human skeletons
[60, 64]. Among these modalities, skeleton-based action
recognition has received increasing attention in recent years
due to its action-focusing nature and compactness. In prac-
tice, human skeletons in a video are mainly represented as
a sequence of joint coordinate lists, where the coordinates
are extracted by pose estimators. Since only the pose infor-
mation is included, skeleton sequences capture only action
information while being immune to contextual nuisances,
such as background variation and lighting changes.
(a) 2D poses estimated with HRNet.
(b) 3D poses collected with Kinect. (c) 3D poses estimated with VIBE.
Figure 1. PoseConv3D takes 2D poses as inputs. In general, 2D
poses are of better quality than 3D poses. We visualize 2D poses
estimated with HRNet for videos in NTU-60 and FineGYM in (a).
Apparently, their quality is much better than 3D poses collected
by sensors (b) or estimated with state-of-the-art estimators (c).
Table 1. Differences between PoseConv3D and GCN.
Previous Work PoseConv3D
Input 2D / 3D Skeleton 2D Skeleton
Format Coordinates 3D Heatmap Volumes
Architecture GCN 3D-CNN
Among all the methods for skeleton-based action
recognition [15, 57, 58], graph convolutional networks
(GCN) [64] have been one of the most popular approaches.
Specifically, GCNs regard every human joint at every
timestep as a node. Neighboring nodes along the spatial
and temporal dimensions are connected with edges. Graph
convolution layers are then applied to the constructed graph
to discover action patterns across space and time. Due to
the good performance on standard benchmarks for skeleton-
based action recognition, GCNs have been a standard ap-
proach when processing skeleton sequences.
While encouraging results have been observed, GCN-
based methods are limited in the following aspects: (1) Ro-
bustness: While GCN directly handles coordinates of hu-
2969