3
crowd analysis [105, 30, 44, 55, 117]. Zhan et al. [105] and
Junior et al. [44] were among the first ones to study and review
existing methods for general crowd analysis. Li et al. [55] sur-
veyed different methods for crowded scene analysis tasks such
as crowd motion pattern learning, crowd behavior, activity anal-
ysis and anomaly detection in crowds. More recently, Zitouni et
al. [117] evaluated existing methods across different research
disciplines by inferring key statistical evidence from existing
literature and provided suggestions towards the general aspects
of techniques rather than any specific algorithm. While these
works focussed on the general aspects of crowd analysis, re-
searchers have studied in detail crowd counting and density es-
timation methods specifically [61, 81, 79]. Loy et al. [61] pro-
vided a detailed description and comparison of video imagery-
based crowd counting and evaluation of different methods us-
ing the same protocol. They also analyzed each processing
module to identify potential bottlenecks to provide new direc-
tions for further research. In another work, Ryan et al. [79]
presented an evaluation of regression-based methods for crowd
counting across multiple datasets and provided a detailed anal-
ysis of performance of various hand-crafted features. Recently,
Saleh et al. [81] surveyed two main approaches which are di-
rect approach (i.e., object based target detection) and indirect
approach (e.g. pixel-based, texture-based, and corner points
based analysis).
Though existing surveys analyze various methods for crowd
analysis and counting, they however cover only traditional
methods that use hand-crafted features and do not take into ac-
count the recent advancements driven primarily by CNN-based
approaches [87, 39, 113, 11, 85, 97, 4, 98, 111, 107, 70, 88]
and creation of new challenging crowd datasets [106, 107, 111].
While CNN-based approaches have achieved drastically lower
error rates, the creation of new datasets has enabled learning of
more generalized models. To keep up with the rapidly advanc-
ing research in crowd counting, we believe it is necessary to an-
alyze these methods in detail in order to understand the trends.
Hence, in this paper, we provide a survey of recent state-of-
the-art CNN-based approaches for crowd counting and density
estimation for single images.
Rest of the paper is organized as follows: Section 2 briefly
reviews the traditional crowd counting and density estimation
approaches with an emphasis on the most recent methods. This
is followed by a detailed survey on CNN-based methods along
with a discussion on their merits and drawbacks in Section 3.
In Section 5, recently published challenging datasets for crowd
counting are discussed in detail along with results of the state-
of-the-art methods. We discuss several promising avenues for
achieving further progress in Section 6. Finally, concluding re-
marks are made in Section 7.
2. Review of traditional approaches
Various approaches have been proposed to tackle the prob-
lem of crowd counting in images [41, 19, 52, 107, 111] and
videos [12, 35, 77, 21]. Loy et al. [61] broadly classi-
fied traditional crowd counting methods based on the approach
into the following categories: (1) Detection-based approaches,
(2) Regression-based approaches, and (3) Density estimation-
based approaches.
Since the focus of this work is on CNN-based approaches,
in this section, we briefly review the detection and regression-
based approaches using hand-crafted features for the sake of
completeness. In addition, we present a review of the recent
traditional methods [41, 52, 75, 99, 102] that have not been an-
alyzed in earlier surveys.
2.1. Detection-based approaches
Most of the initial research was focussed on detection style
framework, where a sliding window detector is used to detect
people in the scene [26] and this information is used to count
the number of people [54]. Detection is usually performed ei-
ther in the monolithic style or parts-based detection. Mono-
lithic detection approaches [25, 51, 94, 28] typically are tra-
ditional pedestrian detection methods which train a classifier
using features (such as Haar wavelets [95], histogram oriented
gradients [25], edgelet [100] and shapelet [80]) extracted from
a full body. Various learning approaches such as Support Vec-
tor Machines, boosting [96] and random forest [34] have been
used with varying degree of success. Though successful in
low density crowd scenes, these methods are adversely affected
by the presence of high density crowds. Researchers have at-
tempted to address this issue by adopting part-based detection
methods [29, 57, 101], where one constructs boosted classifiers
for specific body parts such as the head and shoulder to esti-
mate the people counts in a designated area [54]. In another
approach using shape learning, Zhao et al. [112] modelled hu-
mans using 3D shapes composed of ellipsoids, and employed a
stochastic process to estimate the number and shape configura-
tion that best explains a given foreground mask in a scene. Ge
and Collins [35] further extended the idea by using flexible and
practical shape models.
2.2. Regression-based approaches
Though parts-based and shape-based detectors were used to
mitigate the issues of occlusion, these methods were not suc-
cessful in the presence of extremely dense crowds and high
background clutter. To overcome these issues, researchers at-
tempted to count by regression where they learn a mapping
between features extracted from local image patches to their
counts [16, 78, 20]. By counting using regression, these meth-
ods avoid dependency on learning detectors which is a rela-
tively complex task. These methods have two major compo-
nents: low-level feature extraction and regression modelling.
A variety of features such as foreground features, edge fea-
tures, texture and gradient features have been used for encoding
low-level information. Foreground features are extracted from
foreground segments in a video using standard background sub-
traction techniques. Blob-based holistic features such as area,
perimeter, perimeter-area ration, etc. have demonstrated en-
couraging results [15, 20, 78]. While these methods capture
global properties of the scene, local features such as edges and
texture/gradient features such as local binary pattern (LBP), his-
togram oriented gradients (HOG), gray level co-occurrence ma-
trices (GLCM) have been used to further improve the results.