
Low-Quality Video Face Recognition with Deep
Networks and Polygonal Chain Distance
Christian Herrmann
∗†
, Dieter Willersinn
†
, J
¨
urgen Beyerer
†∗
∗
Vision and Fusion Lab, Karlsruhe Institute of Technology KIT, Karlsruhe, Germany
†
Fraunhofer IOSB, Karlsruhe, Germany
{christian.herrmann|dieter.willersinn|juergen.beyerer}@iosb.fraunhofer.de
Abstract—Face recognition under surveillance circumstances
still poses a significant problem due to low data quality.
Nevertheless, automatic analysis is highly desired for criminal
investigations due to the growing amount of security cameras
worldwide. We suggest a face recognition system addressing the
typical issues such as motion blur, noise or compression arti-
facts to improve low-quality recognition rates. A low-resolution
adapted residual neural net serves as face image descriptor. It
is trained by quality adjusted public training data generated by
data augmentation strategies such as motion blurring or adding
compression artifacts. To further reduce noise effects, a noise
resistant manifold-based face track descriptor using a polygonal
chain is proposed. This leads to a performance improvement
on in-the-wild surveillance data compared to conventional local
feature approaches or the state-of-the-art high-resolution VGG-
Face network.
I. INTRODUCTION
The increasing availability of security cameras raises the
demand for analysis of the vast amounts of video footage,
specifically, automatic analysis because manual inspection is
unfeasible. While older security cameras lack in resolution and
faces are often unrecognizable, with newer camera generations
the faces become clearer and distinguishable. However, the
data quality is usually still far from professional footage such
as TV or press photographs, where automatic face recognition
achieved impressive results recently, surpassing even human
performance in certain setups [1, 2]. Addressing the low-
quality surveillance domain is still a significant challenge
for automatic face recognition approaches, caused by sev-
eral reasons which are misalignment, noise affection, lack of
effective features and dimensional mismatch between probe
and gallery according to [3]. Recently, effective alignment
methods for low-quality faces were proposed [4] which we
found to be sufficiently accurate. Consequently, in this paper,
we suggest an effective Convolutional Neural Network (CNN)-
based feature which proves to be more efficient compared
to previous solutions and address noise affection by data
augmentation and a noise resistant track descriptor to utilize
the temporal information. Data augmentation is necessary
because large face datasets which are suitable for training a
CNN are no surveillance datasets and consequently involve
a domain gap. We suggest according augmentation strategies
such as adding motion blur or compression artifacts to close
this gap.
Found a solution
Config
21 55 12 86
11 6 16 16
Similarities:
1.2429 1.7796 1.4668
0.8777 0.7699 0.9891 0.9272
1.7920 1.8887 1.3830
0.93 0.99 0.77 0.88
1.25
1.79
1.78
1.89
1.47
1.38
Fig. 1: Qualitative results of the proposed method on low-
quality data. Line thickness and numbers denote face similarity
(inverse of descriptor distance) and line color same (blue) and
different (orange) identity.
In detail, the contributions of this paper are threefold:
First, the adaptation of the residual net architecture [5] to
the low-quality face recognition domain by adjusting layer
configuration and setup. Second, a manifold based and noise
resistant strategy using a polygonal chain to aggregate facial
information across multiple frames of a face track. Third, a
systematic analysis of the target domain image quality effects
and their reproduction as data augmentation for high-quality
training data from a different domain.
II. RELATED WORK
Addressing low-quality video face recognition involves sev-
eral specific problems.
Face recognition with CNNs. Currently, CNNs serve only
for high-resolution face recognition where they significantly
improved the performance compared to previously known
approaches, even surpassing human capabilities in certain
setups [1, 2, 6]. Because these networks are mainly based
on solutions for the ImageNet challenge [7], they adopt the
high resolutions of 224×224 pixels and above. Low-resolution
networks tend to loose performance as shown by [1], which
makes it necessary to address this issue.
Low-quality face recognition. One part of the approaches
addressing this task tries to mitigate the low data quality
by preprocessing steps. This includes super-resolution meth-
ods where low-quality images are upsampled to apply a
conventional high-resolution face-recognition strategy [8, 9]
which proved to be a solid strategy for comparing low-quality
978-1-5090-2896-2/16/$31.00 ©2016 IEEE