Multi-Pose Multi-Target Tracking for Activity Understanding
Hamid Izadinia
University of Central Florida
Orlando, FL
izadinia@eecs.ucf.edu
Varun Ramakrishna, Kris M. Kitani, Daniel Huber
Carnegie Mellon University
Pittsburgh, PA
vramakri,kkitani,dhuber@cs.cmu.edu
Abstract
We evaluate the performance of a widely used tracking-
by-detection and data association multi-target tracking
pipeline applied to an activity-rich video dataset. In con-
trast to traditional work on multi-target pedestrian track-
ing where people are largely assumed to be upright, we use
an activity-rich dataset that includes a wide range of body
poses derived from actions such as picking up an object, rid-
ing a bike, digging with a shovel, and sitting down. For each
step of the tracking pipeline, we identify key limitations and
offer practical modifications that enable robust multi-target
tracking over a range of activities. We show that the use
of multiple posture-specific detectors and an appearance-
based data association post-processing step can generate
non-fragmented trajectories essential for holistic activity
understanding.
1. Introduction
We explore the task of multi-target multi-pose person
tracking for activity-rich surveillance videos using the cur-
rent tracking paradigm of tracking-by-detection and data
association. Advances in robust category-specific object
detectors [5, 6] have motivated the tracking-by-detection
paradigm, where robust detectors can act as strong obser-
vation models in tracking frameworks. In particular, re-
cent work has shown that a single coarse part-based model
(e.g., 5 to 15 parts) [7, 10, 22] is well-suited for detecting,
representing and tracking upright people. While these ap-
proaches are effective for urban scenarios, such as pedes-
trians walking on sidewalks or people in subway stations,
difficulties arise when people perform other activities like
riding a bike, digging a hole, or pushing a cart. Although
methods exist for full body pose estimation [21, 8, 24], they
often assume full body part visibility. In this work, we tar-
get surveillance videos that contain a range of human activ-
ity, beyond walking and standing. We evaluate the strengths
and limitations of state-of-the-art multi-target tracking and
offer practical modifications to improve performance.
SAFE HOUSE 1SAFE HOUSE 2
ROA D 1ROAD 2
Figure 1. DARPA Mind’s Eye Y2 activity dataset
We proceed with our analysis by dividing the tracking
pipeline into two stages: person detection and data associa-
tion. In the person detection stage, we compare the results
of standard pedestrian detectors against richer models that
encode variations in pose. In particular, we compare four
different deformable part-models (DPMs) and show that
training models explicitly for different postures improves
performance. In the data association stage, we use a state-
of-the-art multi-target data association framework [20] and
examine how the choice of parameters affects the resulting
trajectories. Specifically, we evaluate the tradeoff between
the recall rate and the number of ID switches as a function
of the parameters. To prevent frequent ID switching and to
preserve longer trajectories, we propose an instance-specific
trajectory merging process as a post-processing step, that
uses appearance-based cues to make associations over long
periods of time.
The contributions of this paper are as follows: (1) step-
by-step analysis of detection-based data association track-
ing for activity-rich videos; (2) a multi-pose deformable
parts model that allows for robust tracking over pose vari-
ations; and (3) long term data association using target-
specific appearance-based regressors.
Work on multi-pedestrian tracking is a significant field
385978-1-4673-5052-5/12/$31.00 ©2012 IEEE 385978-1-4673-5054-9/13/$31.00 ©2013 IEEE