FaceBoxes: A CPU Real-time Face Detector with High Accuracy
Shifeng Zhang Xiangyu Zhu Zhen Lei
*
Hailin Shi Xiaobo Wang Stan Z. Li
CBSR & NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China
University of Chinese Academy of Sciences, Beijing, China
{shifeng.zhang,xiangyu.zhu,zlei,hailin.shi,xiaobo.wang,szli}@nlpr.ia.ac.cn
Abstract
Although tremendous strides have been made in face de-
tection, one of the remaining open challenges is to achieve
real-time speed on the CPU as well as maintain high perfor-
mance, since effective models for face detection tend to be
computationally prohibitive. To address this challenge, we
propose a novel face detector, named FaceBoxes, with supe-
rior performance on both speed and accuracy. Specifically,
our method has a lightweight yet powerful network struc-
ture that consists of the Rapidly Digested Convolutional
Layers (RDCL) and the Multiple Scale Convolutional Lay-
ers (MSCL). The RDCL is designed to enable FaceBoxes
to achieve real-time speed on the CPU. The MSCL aims at
enriching the receptive fields and discretizing anchors over
different layers to handle faces of various scales. Besides,
we propose a new anchor densification strategy to make
different types of anchors have the same density on the
image, which significantly improves the recall rate of small
faces. As a consequence, the proposed detector runs at 20
FPS on a single CPU core and 125 FPS using a GPU for
VGA-resolution images. Moreover, the speed of FaceBoxes
is invariant to the number of faces. We comprehensively
evaluate this method and present state-of-the-art detection
performance on several face detection benchmark datasets,
including the AFW, PASCAL face, and FDDB.
1. Introduction
Face detection is one of the fundamental problems in
computer vision and pattern recognition. It plays an im-
portant role in many subsequent face-related applications,
such as face alignment [46], face recognition [47] and face
tracking [12]. With the great progress over the past few
decades, especially the breakthrough of convolutional neu-
ral network, face detection has been successfully applied in
our daily life under various scenarios.
However, there are still some tough challenges in un-
controlled face detection problem, especially for the CPU
*
Corresponding author
devices. The challenges mainly come from two require-
ments for face detectors: 1) The large visual variation of
faces in the cluttered backgrounds requires face detectors to
accurately address a complicated face and non-face classi-
fication problem; 2) The large search space of possible face
positions and face sizes further imposes a time efficiency
requirement. These two requirements are conflicting, since
high-accuracy face detectors tend to be computationally
expensive. Therefore, it is one of the remaining open issues
for practical face detectors on the CPU devices to achieve
real-time speed as well as maintain high performance.
In order to meet these two conflicting requirements, face
detection has been intensely studied mainly in two ways.
The early way is based on hand-craft features. Follow-
ing the pioneering work of Viola-Jones face detector [37],
most of the early works focus on designing robust fea-
tures and training effective classifiers. Besides the cascade
structure, the deformable part model (DPM) is introduced
into face detection tasks and achieves remarkable perfor-
mance. However, these methods highly depend on non-
robust hand-craft features and optimize each component
separately, making the face detection pipeline sub-optimal.
In brief, they are efficient on the CPU but not accurate
enough against the large visual variation of faces.
The other way is based on the convolutional neural net-
work (CNN) which has achieved remarkable successes in
recent years, ranging from image classification to object
detection. Recently, CNN has been successfully introduced
into the face detection task as feature extractor in the tra-
ditional face detection framewrok [23, 41, 42]. Moreover,
some face detectors [4, 45] have inherited valid techniques
from the generic object detection methods, such as Faster
R-CNN [29]. These CNN based face detection methods
are robust to the large variation of facial appearances and
demonstrate state-of-the-art performance. But they are too
time-consuming to achieve real-time speed, especially on
the CPU devices.
These two ways have their own advantages. The for-
mer has fast speed while the latter owns high accuracy.
To perform well on both speed and accuracy, one natural
2017 IEEE International Joint Conference on Biometrics (IJCB)978-1-5386-1124-1/17/$31.00 ©2017 IEEE
1