深度学习驱动的单图像人群计数与密度估计进展综述

需积分: 10 170 浏览量更新于2024-07-16 收藏 4.11MB PDF 举报

“A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation”是一篇研究文章，探讨了基于卷积神经网络（CNN）在单张图像中的人群计数和密度估计的最新进展。人群计数和密度估计是计算机视觉领域中的重要问题，尤其在监控、公共安全和事件管理中具有广泛的应用。传统的解决方法包括基于检测、回归和密度估计的方法。然而，随着深度学习的发展，尤其是CNN的广泛应用，这些方法得到了显著的改进。基于CNN的方法根据网络特性可以分为三类：基础CNN、规模感知模型和上下文感知模型。基础CNN通常指的是利用如AlexNet这样的经典网络结构进行改造，例如在AlexNet的最后全连接层上调整神经元数量以直接预测人数。规模感知模型关注于处理不同大小的人头，而上下文感知模型则考虑了图像的整体上下文信息，这对于理解人群分布至关重要。输入数据的处理方式也有两种主要类型：基于块和基于完整图像。基于块的方法将图像分割成多个部分，分别处理，适合处理密集人群。例如，Deep people counting in extremely dense crowds中提到的端到端深度CNN模型，通过分割图像并应用AlexNet进行人数预测。而基于完整图像的方法则尝试全局处理整个图像，如Fast crowd density estimation with convolutional neural networks，它采用Multi-stage ConvNet，将图像分为五类密度，并使用两个串联的分类器提升估计精度。文章还提到了其他相关项目，如CrowdAnalytics和ImageEnhancement，这表明研究人员正在进一步探索和优化人群分析和图像增强技术。作者Vishwanath Sindagi和Vishal M. Patel等人在Rutgers大学的电气与计算机工程系工作，他们的贡献在该领域产生了95次引用，表明了该研究的影响力。这篇研究综述了基于CNN的单图像人群计数和密度估计的最新技术，涵盖了不同的网络架构和处理策略，展示了深度学习在解决复杂视觉问题上的强大能力。这些方法不仅提高了计数的准确性，还提供了对人群分布的深入理解，对于未来的人工智能应用具有重要价值。

crowd analysis [105, 30, 44, 55, 117]. Zhan et al. [105] and

Junior et al. [44] were among the ﬁrst ones to study and review

existing methods for general crowd analysis. Li et al. [55] sur-

veyed diﬀerent methods for crowded scene analysis tasks such

as crowd motion pattern learning, crowd behavior, activity anal-

ysis and anomaly detection in crowds. More recently, Zitouni et

al. [117] evaluated existing methods across diﬀerent research

disciplines by inferring key statistical evidence from existing

literature and provided suggestions towards the general aspects

of techniques rather than any speciﬁc algorithm. While these

works focussed on the general aspects of crowd analysis, re-

searchers have studied in detail crowd counting and density es-

timation methods speciﬁcally [61, 81, 79]. Loy et al. [61] pro-

vided a detailed description and comparison of video imagery-

based crowd counting and evaluation of diﬀerent methods us-

ing the same protocol. They also analyzed each processing

module to identify potential bottlenecks to provide new direc-

tions for further research. In another work, Ryan et al. [79]

presented an evaluation of regression-based methods for crowd

counting across multiple datasets and provided a detailed anal-

ysis of performance of various hand-crafted features. Recently,

Saleh et al. [81] surveyed two main approaches which are di-

rect approach (i.e., object based target detection) and indirect

approach (e.g. pixel-based, texture-based, and corner points

based analysis).

Though existing surveys analyze various methods for crowd

analysis and counting, they however cover only traditional

methods that use hand-crafted features and do not take into ac-

count the recent advancements driven primarily by CNN-based

approaches [87, 39, 113, 11, 85, 97, 4, 98, 111, 107, 70, 88]

and creation of new challenging crowd datasets [106, 107, 111].

While CNN-based approaches have achieved drastically lower

error rates, the creation of new datasets has enabled learning of

more generalized models. To keep up with the rapidly advanc-

ing research in crowd counting, we believe it is necessary to an-

alyze these methods in detail in order to understand the trends.

Hence, in this paper, we provide a survey of recent state-of-

the-art CNN-based approaches for crowd counting and density

estimation for single images.

Rest of the paper is organized as follows: Section 2 brieﬂy

reviews the traditional crowd counting and density estimation

approaches with an emphasis on the most recent methods. This

is followed by a detailed survey on CNN-based methods along

with a discussion on their merits and drawbacks in Section 3.

In Section 5, recently published challenging datasets for crowd

counting are discussed in detail along with results of the state-

of-the-art methods. We discuss several promising avenues for

achieving further progress in Section 6. Finally, concluding re-

marks are made in Section 7.

2. Review of traditional approaches

Various approaches have been proposed to tackle the prob-

lem of crowd counting in images [41, 19, 52, 107, 111] and

videos [12, 35, 77, 21]. Loy et al. [61] broadly classi-

ﬁed traditional crowd counting methods based on the approach

into the following categories: (1) Detection-based approaches,

(2) Regression-based approaches, and (3) Density estimation-

based approaches.

Since the focus of this work is on CNN-based approaches,

in this section, we brieﬂy review the detection and regression-

based approaches using hand-crafted features for the sake of

completeness. In addition, we present a review of the recent

traditional methods [41, 52, 75, 99, 102] that have not been an-

alyzed in earlier surveys.

2.1. Detection-based approaches

Most of the initial research was focussed on detection style

framework, where a sliding window detector is used to detect

people in the scene [26] and this information is used to count

the number of people [54]. Detection is usually performed ei-

ther in the monolithic style or parts-based detection. Mono-

lithic detection approaches [25, 51, 94, 28] typically are tra-

ditional pedestrian detection methods which train a classiﬁer

using features (such as Haar wavelets [95], histogram oriented

gradients [25], edgelet [100] and shapelet [80]) extracted from

a full body. Various learning approaches such as Support Vec-

tor Machines, boosting [96] and random forest [34] have been

used with varying degree of success. Though successful in

low density crowd scenes, these methods are adversely aﬀected

by the presence of high density crowds. Researchers have at-

tempted to address this issue by adopting part-based detection

methods [29, 57, 101], where one constructs boosted classiﬁers

for speciﬁc body parts such as the head and shoulder to esti-

mate the people counts in a designated area [54]. In another

approach using shape learning, Zhao et al. [112] modelled hu-

mans using 3D shapes composed of ellipsoids, and employed a

stochastic process to estimate the number and shape conﬁgura-

tion that best explains a given foreground mask in a scene. Ge

and Collins [35] further extended the idea by using ﬂexible and

practical shape models.

2.2. Regression-based approaches

Though parts-based and shape-based detectors were used to

mitigate the issues of occlusion, these methods were not suc-

cessful in the presence of extremely dense crowds and high

background clutter. To overcome these issues, researchers at-

tempted to count by regression where they learn a mapping

between features extracted from local image patches to their

counts [16, 78, 20]. By counting using regression, these meth-

ods avoid dependency on learning detectors which is a rela-

tively complex task. These methods have two major compo-

nents: low-level feature extraction and regression modelling.

A variety of features such as foreground features, edge fea-

tures, texture and gradient features have been used for encoding

low-level information. Foreground features are extracted from

foreground segments in a video using standard background sub-

traction techniques. Blob-based holistic features such as area,

perimeter, perimeter-area ration, etc. have demonstrated en-

couraging results [15, 20, 78]. While these methods capture

global properties of the scene, local features such as edges and

texture/gradient features such as local binary pattern (LBP), his-

togram oriented gradients (HOG), gray level co-occurrence ma-

trices (GLCM) have been used to further improve the results.

剩余16页未读，继续阅读

18级

粉丝: 3
资源: 3

深度学习驱动的单图像人群计数与密度估计进展综述

最新资源