IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011 1037
Predicting Pedestrian Counts in Crowded Scenes
With Rich and High-Dimensional Features
Junping Zhang, Member, IEEE, Ben Tan, Fei Sha, and Li He
Abstract—Estimating the number of pedestrians in surveil-
lance images and videos has important applications in intelligent
transportation systems. This problem is particularly challenging
when the scenes are densely crowded, in which the techniques of
tracking a single pedestrian has limited effectiveness. Alternative
approaches employ statistical learning algorithms to infer pedes-
trian counts directly from visual features computed on images or
scenes. In this paper, we describe a system for predicting pedes-
trian counts that significantly extends the utility of those ideas.
Our approach incorporates a richer set of features for statistical
modeling. While these features give rise to regression problems in a
high-dimensional space, we leverage learning techniques to reduce
dimensionality while still attaining high accuracy for predict-
ing the number of pedestrians. Empirical results have validated
our strategy. Specifically, our system outperforms state-of-the-art
methods on standard benchmark tasks by a large margin.
Index Terms—Ensemble learning, Gaussian processes, kernel
dimension reduction (KDR), pedestrian counting, statistical land-
scape features (SLFs).
I. INTRODUCTION
E
STIMATING the number of pedestrians has many ap-
plications in intelligent transportation systems. Pedes-
trian counts have been used to optimize the design of traffic
infrastructures and manage practices for transportation and
pedestrian safety [1], [2]. In emergency response systems,
counting pedestrians with high accuracy provides timely and
valuable feedback to guide mass evacuation [3]. Of particular
interest is to automatically infer pedestrian counts from sur-
veillance images and videos. Such interest has been instigated
Manuscript received March 19, 2010; revised December 3, 2010 and
February 23, 2011; accepted March 10, 2011. Date of publication April 19,
2011; date of current version December 5, 2011. This work was supported
in part by the 973 Program under Project 2006CB705506 and Project
2010CB327900; by the National Science Foundation of China under Grant
60975044; by Fudan University Key Laboratory Senior Visiting Scholarship;
and by the State Key Laboratory of Rail Traffic Control and Safety, Beijing
Jiaotong University, under Contract RCS2008007. The Associate Editor for this
paper was S. Tang.
J. Zhang is with Shanghai Key Laboratory of Intelligent Information
Processing, School of Computer Science, Fudan University, Shanghai 200433,
China, and also with the State Key Laboratory of Rail Traffic Control
and Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail:
jpzhang@fudan.edu.cn).
B. Tan is with Shanghai Key Laboratory of Intelligent Information Process-
ing, School of Computer Science, Fudan University, Shanghai 200433, China
(e-mail: tanben@yeah.net).
F. Sha is with the Department of Computer Science, Viterbi School of
Engineering, University of Southern California, Los Angeles, CA 90089 USA
(e-mail: feisha@usc.edu).
L. He is with Yahoo! Labs China, Beijing 100083, China (e-mail:
sigmastudio@gmail.com).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TITS.2011.2132759
by widespread deployments of surveillance video cameras in
public areas [4].
Automatic inference of pedestrian counts from surveillance
images is a challenging task for image processing and computer
vision. Broadly speaking, two types of approaches have been
investigated. The first type relies on reliable tracking of indi-
vidual pedestrians [5], [6]. These methods are well suited for
images with a few number of pedestrians (and moving objects).
When the density of the pedestrian crowd increases, the perfor-
mance of tracking-based methods starts to deteriorate. This is
often attributed to significant occlusion in the scenes, as well
as large variances in pedestrian appearances, including height,
clothing, accessories, etc. With these complicating factors, de-
tecting individual pedestrians quickly becomes impractical.
A more scalable approach is to directly estimate the counts
without identifying individuals in complicated scenes. Intu-
itively, one views the task of estimating pedestrian counts
(or crowd density) as a regression problem. In the regression
model, the inputs (i.e., the covariates) are visual features com-
puted on images, and the output (i.e., the response variable) is
the pedestrian count or the crowd density. Parameters of these
regression models can be estimated from training data, i.e.,
images annotated with a known number of pedestrians.
Davies et al. examined this kind of approach with geomet-
rical features such as areas (the number of pixels occupied)
and perimeters (the number of pixels in the edges) [7]. They
used a linear regression model between features and pedes-
trian counts. Since object sizes depend on view angles and
distances between imaging planes of cameras and pedestrians,
Ma et al. [8] and Chan et al. [9] studied these issues and pro-
posed methods to normalize the effect of imaging differences.
There have also been experiments with other types of features.
For instance, the Minkowski fractal dimension of edges, which
describes the irregularity of edges, was shown to correlate with
denseness of pedestrians in the images [10].
Dong et al. [11] built a lookup table between silhouettes and
pedestrian counts and pedestrians’ configuration in the crowd
so that each silhouette corresponds to a pair of pedestrian count
and configuration. The silhouette is calculated by sampling
some points along its external boundary and then transforming
these points to a frequency domain using a discrete Fourier
transform. While this method can be fast and accurate, it only
works well when the pedestrian density is small enough such
that each connected region contains only a few pedestrians, as
suggested by the empirical studies reported in [11].
In this paper, we extend these approaches and describe our
system of pedestrian counting for crowded scenes. In particular,
we have experimented with a rich set of features that were
1524-9050/$26.00 © 2011 IEEE