Recurrent Modeling of Interaction Context for Collective Activity Recognition
Minsi Wang, Bingbing Ni, Xiaokang Yang
Shanghai Jiao Tong University
mswang1994@gmail.com, {nibingbing,xkyang}@sjtu.edu.cn
Abstract
Modeling of high order interactional context, e.g., group
interaction, lies in the central of collective/group activity
recognition. However, most of the previous activity recog-
nition methods do not offer a flexible and scalable scheme
to handle the high order context modeling problem. To
explicitly address this fundamental bottleneck, we propose
a recurrent interactional context modeling scheme based
on LSTM network. By utilizing the information propaga-
tion/aggregation capability of LSTM, the proposed scheme
unifies the interactional feature modeling process for single
person dynamics, intra-group (e.g., persons within a group)
and inter-group (e.g., group to group) interactions. The pro-
posed high order context modeling scheme produces more
discriminative/descriptive interactional features. It is very
flexible to handle a varying number of input instances (e.g.,
different number of persons in a group or different number
of groups) and linearly scalable to high order context mod-
eling problem. Extensive experiments on two benchmark
collective/group activity datasets demonstrate the effective-
ness of the proposed method.
1. Introduction
Analysis of collective activity groups provides useful in-
formation for several real-world applications including so-
cial role understanding and social event prediction. The
main challenge of collective activity recognition is mod-
eling of interactional context information among persons.
This is because that the number of persons involved in an
interaction is always varying. Moreover, in most cases a
collective activity is associated with several sub groups of
interactions, and how to model the group to group interac-
tion is even more challenging.
Previous methods for activity recognition mainly focus-
es on modeling unary features, e.g., single person appear-
ance or dynamics information [21, 26] and person to person
interaction (e.g., pairwise features) [22]. However, these
contextual information modeling schemes are not sufficien-
t for collective activity recognition. It is because that in
LSTM LSTM LSTM
LSTM LSTM LSTM
LSTM LSTM LSTM
Person Level
Context
Group Level
Context
Scene Level
Context
Input Image Sequence
t-1
t-2
t
Figure 1. The overview of proposed framework. A hierarchical
recurrent interactional context modeling framework is proposed to
model intra-group and inter-group interaction context.
collective activity, different activity categories might share
the same type of unary or pairwise features (e.g., “stand-
ing alone” in the cases of queueing or discussion, “facing
to same direction” in the case of walking and crossing). In
other words, besides modeling the intra-group interaction
(e.g., interaction among the persons within a group), how
to effectively describe the group to group interaction is of
more importance. Low order contextual features do not pro-
vide sufficient cues to recognize these activities. To address
this fundamental problem, most previous methods attemp-
t to encode the high order relationship among persons in
the scene by inferring the latent graphical structures [9, 8].
However, applying these approaches for collective activi-
ty recognition is infeasible because these methods often re-
quires high computational cost in the case of tree-structured
model during inference and learning. Moreover, it is very
difficult to generalize methods based on graphical models to
handle higher order interactional context. Ni et al.[24] pro-
posed a causality analysis framework to encode unary, pair-
wise and group interaction features. However, this method
is only capable of modeling human trajectory level informa-
tion, which is insufficient to recognize finer-grained actions,
e.g., those can only be recognized by human appearance or
local body part dynamics.
A fundamental problem becomes: how to systematically
encode the high order human interactional context, i.e, the
2017 IEEE Conference on Computer Vision and Pattern Recognition
1063-6919/17 $31.00 © 2017 IEEE
DOI 10.1109/CVPR.2017.783
7408