Cross-scene Crowd Counting via Deep Convolutional Neural Networks
Cong Zhang
1,2
Hongsheng Li
2,3
Xiaogang Wang
2
Xiaokang Yang
1
1
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
2
Department of Electronic Engineering, The Chinese University of Hong Kong
3
School of Electronic Engineering, University of Electronic Science and Technology of China
{zhangcong0929,lihongsheng}@gmail.com xgwang@ee.cuhk.edu.hk xkyang@sjtu.edu.cn
Abstract
Cross-scene crowd counting is a challenging task where
no laborious data annotation is required for counting peo-
ple in new target surveillance crowd scenes unseen in the
training set. The performance of most existing crowd count-
ing methods drops significantly when they are applied to
an unseen scene. To address this problem, we propose a
deep convolutional neural network (CNN) for crowd count-
ing, and it is trained alternatively with two related learning
objectives, crowd density and crowd count. This proposed
switchable learning approach is able to obtain better lo-
cal optimum for both objectives. To handle an unseen tar-
get crowd scene, we present a data-driven method to fine-
tune the trained CNN model for the target scene. A new
dataset including 108 crowd scenes with nearly 200,000
head annotations is introduced to better evaluate the ac-
curacy of cross-scene crowd counting methods. Exten-
sive experiments on the proposed and another two existing
datasets demonstrate the effectiveness and reliability of our
approach.
1. Introduction
Counting crowd pedestrians in videos draws a lot of at-
tention because of its intense demands in video surveil-
lance, and it is especially important for metropolis secu-
rity. Crowd counting is a challenging task due to severe
occlusions, scene perspective distortions and diverse crowd
distributions. Since pedestrian detection and tracking has
difficulty when being used in crowd scenes, most state-of-
the-art methods [6, 4, 5, 17] are regression based and the
goal is to learn a mapping between low-level features and
crowd counts. However, these works are scene-specific, i.e.,
a crowd counting model learned for a particular scene can
only be applied to the same scene. Given an unseen scene
or a changed scene layout, the model has to be re-trained
with new annotations. There are few works focusing on
cross-scene crowd counting, though it is important to actual
applications.
In this paper, we propose a framework for cross-scene
crowd counting. No extra annotations are needed for a new
target scene. Our goal is to learn a mapping from images
to crowd counts, and then to use the mapping in unseen tar-
get scenes for cross-scene crowd counting. To achieve this
goal, we need to overcome the following challenges. 1) De-
velop effective features to describe crowd. Previous works
used general hand-crafted features, which have low repre-
sentation capability for crowd. New descriptors specially
designed or learned for crowd scenes are needed. 2) Dif-
ferent scenes have different perspective distortions, crowd
distributions and lighting conditions. Without additional
training data, the model trained in one specific scene has
difficulty being used for other scenes. 3) For most recent
works, foreground segmentation is indispensable for crowd
counting. But crowd segmentation is a challenging problem
and can not be accurately obtained in most crowded scenes.
The scene may also have stationary crowd without move-
ment. 4) Existing crowd counting datasets are not sufficient
to support and evaluate cross-scene counting research. The
largest one [8] only contains 50 static images from differ-
ent crowd scenes collected from Flickr. The widely used
UCSD dataset [4] and the Mall dataset [6] only consist of
video clips collected from one or two scenes.
Considering these challenges, we propose a Convolu-
tional Neural Network (CNN) based framework for cross-
scene crowd counting. After a CNN is trained with a fixed
dataset, a data-driven method is introduced to fine-tune
(adapt) the learned CNN to an unseen target scene, where
training samples similar to the target scene are retrieved
from the training scenes for fine-tuning. Figure 1 illustrates
the overall framework of our proposed method. Our cross-
scene crowd density estimation and counting framework has
following advantages:
1. Our CNN model is trained for crowd scenes by a
switchable learning process with two learning objectives,
crowd density maps and crowd counts. The two different
but related objectives can alternatively assist each other to
1