A Post-Processing Approach
in Moving Objects Detection via Feature Pyramid Networks
Li Lin, Bin Wang, Yinjuan Gu
School of Communication and Information Engineering
Shanghai University, Shanghai 200072, China
jocelyn_ly@shu.edu.cn
Abstract— Recent work has shown that Convolutional Neural
Networks (CNNs) have great ability to deal with classification
problems in pattern recognition field. Moving objects
detection, regarding as a classification process, labels every
pixel as a foreground pixel or a background pixel. In this
paper, we proposed an effective post-processing approach,
Residual Background Networks (ResBGNets), to improve the
accuracy of moving objects detection in video sequences.
Instead of learning the ground truth directly, our model
learns the residual pictures between the results of existing
methods and the ground truth. It benefits to understand the
hidden character of each algorithm and correct the
misclassification. Inside ResBGNets, we build Feature
Pyramid Networks (FPN) to combine spatial information of
the low-resolution level with semantical features of high-level
of the high-resolution level. Evaluation performed on the
2014 CDnet dataset reveals that through our approach, most
of the existing background subtraction methods can get
better detection results and a significant higher FM score.
Keywords-Moving objects detection; convolutional neural
networks; background subtraction; residual pictures; feature
pyramids
I. INTRODUCTION
In the past few years, video surveillance is not only
applied in traditional areas that need security such as banks,
airports or traffic, but also widely used in other aspects of
our daily life. Analysing millions of those captured video
sequences manually requires a considerable amount of
time. Fortunately, computer technology today is capable to
realize it effectively. In vehicle tracking [1], people
counting [2], action recognition [3], and many other
computer vision applications [4-5], moving objects
detection is always exploited as the primary work. Thus, it
gains strong concerns and interests from researchers. The
main purpose of motion detection is to separate foreground
and background pixels. In consideration of complexity and
uncertainty in real-world environment, the traditional
method which regards a static image as background
reference has been replaced by various state-of-the-art
background subtraction methods and supervised machine
learning algorithms.
Background subtraction methods can complete the
detection without any manual intervention. Single
Gaussian model [6] uses just one Gaussian function to
estimate the distribution of a background pixel. Such
model is only suitable for constant scenes. Gaussian
Mixture Model (GMM) [7-8] is an extension of single
Gaussian model. It describes a background pixel by a
mixture of K or adaptive Gaussian distributions so that it
can deal with a dynamic complex background (e.g. rain,
swaying tree leaves, and tipples). Differing from GMM,
Non-parametric model based on Kernel Density
Estimation (KDE) [9] determines its background
probability density functions according to the very recent
observations completely. These classical probabilistic
approaches always do not perform well in case of
encountering camouflage, cast shadows, sudden
illumination changes, camera motion and so on.
Some non-mathematical background subtraction
methods [10-13] also achieve an accurate result without
the need for manual intervention. Visual Background
extractor (ViBe) [10] is modelling for every pixel with just
twenty colour values so that it can save much space and
relieve memory pressure. Rapid and simple ‘one-frame-
initialization’ is another remarkable advantage of it. Self-
Balanced SENsitivity SEgmenter (SuBSENSE) [11] makes
some improvements based on ViBe. It suggests that
individual pixels are characterized by not only colour
values but also local texture features. The decision
threshold and the update rate, which are fixed in ViBe
algorithm, are adaptive to monitor the background
dynamics segmentation noise. Bin Wang et al. [12]
proposed a fast and effective Adapting Multi-resolution
Background ExtractoR (AMBER), which applies
efficacies to indicate the matching frequency for each
background value. The innovations of Multimode
Background Subtraction (MBS) [13] are the use of
multiple colour spaces, background model bank for
background modelling process, Mega-Pixels formation and
so on.
Although most of the background subtraction methods
have reached a certain degree of accuracy, their F-
measures are still too low compared to supervised machine
learning algorithms. Yi Wang et al. [14] proposed a multi-
resolution convolutional neural network with a cascaded
architecture named Cascade CNN. It uses a limited number
of ground truth images, where every foreground moving
object is manually annotated, as the training set. Lim et al.
[15] proposed an encoder-decoder type network model,
which contains a triplet CNN operating in three different
scales for feature encoding and a transposed convolutional
network for decoding. The method in [16] randomly