Robust Real-Time Pedestrian Detection on Embedded Devices
Mohamed Afifi
∗
, Yara Ali
, Karim Amer, Mahmoud Shaker, and Mohamed Elhelw
Center for Informatics Science, Nile University, Giza, Egypt
ABSTRACT
Detection of pedestrians on embedded devices, such as those on-board of robots and drones, has many applications
including road intersection monitoring, security, crowd monitoring and surveillance, to name a few. However,
the problem can be challenging due to continuously-changing camera viewpoint and varying object appearances
as well as the need for lightweight algorithms suitable for embedded systems. This paper proposes a robust
framework for pedestrian detection in many footages. The framework performs fine and coarse detections on
different image regions and exploits temporal and spatial characteristics to attain enhanced accuracy and real time
performance on embedded boards. The framework uses the Yolo-v3 object detection [1] as its backbone
detector and runs on the Nvidia Jetson TX2 embedded board, however other detectors and/or boards can be
used as well. The performance of the framework is demonstrated on two established datasets and its achievement
of the second place in CVPR 2019 Embedded Real-Time Inference (ERTI) Challenge
.
Keywords: Pedestrian detection, UAV, real time inference.
1. INTRODUCTION
Various deep learning architectures have been proposed since Krizhevsky et. al. [2] trained a neural network model
of multiple convolutional and feedforward layers on large dataset of images for object classification, numerous
deep learning architectures have been proposed. One family of these architectures is designed for object detection
which entails predicting bounding boxes that enclose objects of interest in a certain image. The state of the art
approaches for this task can be roughly divided into two categories. The first include two-stage models such as
R-CNN [3], Fast R-CNN [4], Faster R-CNN [5] and SPP-net [6]. These models propose search regions then process and
classify those regions. The second category comprises single-stage models such as Yolo [7] and SSD [8]. Two-stage
object detection models achieve better accuracy but with slow inference due to demanding computations. On
the other hand, single-stage models are faster with lower accuracy compared to two-stage models.
In order to deploy the above models onboard of embedded devices, two important aspects must be taken into
consideration. First, typical embedded devices have limited computational power. Second, a sequence of images
(i.e. video) must be processed. Recent work aimed to address these constraints by creating light-weight versions
of original models such as Tiny-Yolo1 and SSD300.8 Other approaches such as MobileNet [9] and ShuffleNet [10]
optimize the base of a pre-trained network to have higher FPS. Lu et. al. [11] incorporated a Long Short-Term
Memory (LSTM) model to make use of the spatio-temporal relation among consecutive frames in a video while
Broad et. al. [12] added a convolutional recurrent layer to the SSD architecture to fuse temporal information.
This paper proposes a novel framework for robust real-time pedestrian detection in videos captured above
street level such as those from pole-mounted security cameras. The framework uses the Yolo-v3 as its backbone
detector but will work with other detectors with similar features. It exploits temporal information in videos while
performing real-time inference by combining deep learning models pre-trained on large scale dataset of single
images. Multiple input resolutions are used to perform robust pedestrian detection with a high throughput
making it suitable for real-time operation on the Nvidia Jetson TX2 and similar embedded boards. Figure 1 shows an
example where the proposed framework clearly achieves improved results compared to the Yolo-v3
detector.
Indicates equal contribution
Further author information: (Send correspondence to Karim Amer)
Karim Amer: E-mail: k.amer@nu.edu.eg
https://sites.google.com/site/uavision2019/