TaiChi: A Fine-Grained Action Recognition Dataset
Shan Sun, Feng Wang
∗
, Qi Liang, Liang He
Shanghai Key Laboratory of Multidimensional Information Processing
Dept. of Computer Science and Technology, East China Normal University
52141201004@stu.ecnu.edu.cn,fwang@cs.ecnu.edu.cn,51151201039@stu.ecnu.edu.cn,lhe@cs.ecnu.edu.cn
ABSTRACT
In this paper, we introduce TaiChi which is a fine-grained
action dataset. It consists of unconstrained user-uploaded
web videos containing camera motion and partial occlusions
which pose new challenges to fine-grained action recognition
compared to the existing datasets. In this dataset, 2,772 sam-
ples of 58 fine-grained action classes are manually annotated.
Additionally, we provide the baseline action recognition re-
sults using the state-of-the-art Improved Dense Trajectory
feature and Fisher Vector representation with an MAP (Mean
Average Precision) of 51.39%.
KEYWORDS
Fine-grained action recognition dataset; Tai Chi; benchmark
dataset
ACM Reference format:
Shan Sun, Feng Wang
∗
, Qi Liang, Liang He. 2017. TaiChi: A Fine-
Grained Action Recognition Dataset. In Proceedings of ICMR ’17,
June 6–9, 2017, Bucharest, Romania, , 5 pages.
DOI: http://dx.doi.org/10.1145/3078971.3079039
1 INTRODUCTION
With the explosive growth of videos on the Internet, numer-
ous works have been devoted to automatic understanding of
the video content. Among them, human action recognition
attracts a lot of research attention since it is widely used in
various applications such as video surveillance, indexing, and
event recounting. Human action recognition is faced with a
number of challenges such as complex human actions, large
intra-class variability, background motion, and occlusions. A
lot of approaches have been proposed to tackle these issues.
Most existing works focus on classifying coarsely-grained ac-
tions with relatively large inter-class variations, for instance,
to distinguish football from basketball sports. Meanwhile,
fine-grained human action recognition is rarely studied. Fine-
grained human action recognition aims to distinguish between
different actions with low inter-class variability. Compared
∗
Corresponding author.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
ICMR ’17, June 6–9, 2017, Bucharest, Romania
© 2017 ACM. ACM ISBN 978-1-4503-4701-3/17/06. . . $$15.00
DOI: http://dx.doi.org/10.1145/3078971.3079039
to the traditional action recognition, fine-grained actions are
usually with small spatial and temporal scales. In many cases,
a fine-grained action is a part of a higher-level action, and
shares the same context with other fine-grained actions. For
instance, throwing and slam dunk are two fine-grained actions
in action basketball, and they share the same background,
actors, and objects. The traditional coarsely-grained action
recognition distinguishes basketball from other actions, but
rarely distinguishes throwing from slam dunk. However, the
distinction of highly similar actions is necessary for many
applications. For instance, for one who wants to learn bas-
ketball from online videos, s/he would like to search the clips
containing action throwing or slam dunk in order to practice
the specific skills rather than the clips simply labelled as
basketball.
Fine-grained action recognition presents more detailed
understanding of the video content. However, it has not been
extensively studied. One reason for the lack of research on
fine-grained action recognition is the absence of benchmark
datasets. Most existing datasets such as the KTH [
1
], the
Weizmann [
2
], the Hollywood [
4
,
5
], the UCF databases [
3
,
6
,
7
,
9
] and the CCV [
10
] are without fine-grained labels. With
these datasets, we can distinguish basketball from football,
but cannot learn how to distinguish throwing from slam
drunk. The MPII database [
13
] released in 2013 is finely
labelled. However, all the videos are captured in a fixed
kitchen with a stationary camera and mainly focus on the
hand actions, which cannot meet the requirements of the
realistic applications. The FGA-240 [
14
] dataset is another
fine-grained action dataset with a very large scale. However,
only the videos containing actions are released, while the
start and the end frames of each action are not annotated.
In this paper, we introduce and release a new fine-grained
action dataset called TaiChi, which is composed by videos
about Tai Chi sports. Tai Chi is a traditional Chinese art.
It expresses as a kind of sport composed of slow, soft, and
continuously flowing movements. There are currently lots
of Tai Chi genres in different styles, and [
16
] provides a
glance of Tai Chi styles. The Tai Chi sport is composed of
a number of moves, and each move consists of a few basic
actions of different body parts. Different Tai Chi genres
develop different Tai Chi moves, but they share the same
basic Tai Chi actions. The basic Tai Chi actions compose
numerous Tai Chi moves by different permutations. We take
the basic Tai Chi actions as the fine-grained actions in our
dataset which are very similar to each other. Specifically, the
TaiChi dataset contains 58 fine-grained Tai Chi actions of
different body parts such as hand, arm, leg, and foot. The
ICMR’17, June 6–9, 2017, Bucharest, Romania