Fine-Grained Head Pose Estimation Without Keypoints
Nataniel Ruiz Eunji Chong James M. Rehg
Georgia Institute of Technology
{nataniel.ruiz, eunjichong, rehg}@gatech.edu
Abstract
Estimating the head pose of a person is a crucial prob-
lem that has a large amount of applications such as aiding
in gaze estimation, modeling attention, fitting 3D models
to video and performing face alignment. Traditionally head
pose is computed by estimating some keypoints from the tar-
get face and solving the 2D to 3D correspondence problem
with a mean human head model. We argue that this is a
fragile method because it relies entirely on landmark detec-
tion performance, the extraneous head model and an ad-hoc
fitting step. We present an elegant and robust way to deter-
mine pose by training a multi-loss convolutional neural net-
work on 300W-LP, a large synthetically expanded dataset,
to predict intrinsic Euler angles (yaw, pitch and roll) di-
rectly from image intensities through joint binned pose clas-
sification and regression. We present empirical tests on
common in-the-wild pose benchmark datasets which show
state-of-the-art results. Additionally we test our method on
a dataset usually used for pose estimation using depth and
start to close the gap with state-of-the-art depth pose meth-
ods. We open-source our training and testing code as well
as release our pre-trained models
1
.
1. INTRODUCTION
The related problems of head pose estimation and fa-
cial expression tracking have played an important role over
the past 25 years in driving vision technologies for non-
rigid registration and 3D reconstruction and enabling new
ways to manipulate multimedia content and interact with
users. Historically, there have been several major ap-
proaches to face modeling, with two primary ones being
discriminative/landmark-based approaches [
26, 29] and pa-
rameterized appearance models, or PAMs [
4, 15] (see [30]
for additional discussion). In recent years, methods which
directly extract 2D facial keypoints using modern deep
learning tools [
2, 35, 14] have become the dominant ap-
proach to facial expression analysis, due to their flexibility
1
https://github.com/natanielruiz/deep-head-pose
and robustness to occlusions and extreme pose changes. A
by-product of keypoint-based facial expression analysis is
the ability to recover the 3D pose of the head, by establish-
ing correspondence between the keypoints and a 3D head
model and performing alignment. However, in some ap-
plications the head pose may be all that needs to be esti-
mated. In that case, is the keypoint-based approach still the
best way forward? This question has not been thoroughly-
addressed using modern deep learning tools, a gap in the
literature that this paper attempts to fill.
We demonstrate that a direct, holistic approach to esti-
mating 3D head pose from image intensities using convo-
lutional neural networks delivers superior accuracy in com-
parison to keypoint-based methods. While keypoint detec-
tors have recently improved dramatically due to deep learn-
ing, head pose recovery inherently is a two step process with
numerous opportunities for error. First, if sufficient key-
points fail to be detected, then pose recovery is impossible.
Second, the accuracy of the pose estimate depends upon the
quality of the 3D head model. Generic head models can
introduce errors for any given participant, and the process
of deforming the head model to adapt to each participant
requires significant amounts of data and can be computa-
tionally expensive.
While it is common for deep learning based methods us-
ing keypoints to jointly predict head pose along with fa-
cial landmarks, the goal in this case is to improve the accu-
racy of the facial landmark predictions, and the head pose
branch is not sufficiently accurate on its own: for exam-
ple [
14, 20, 21] which are studied in Section 4.1 and 4.3.
A conv-net architecture which directly predicts head pose
has the potential to be much simpler, more accurate, and
faster. While other works have addressed the direct regres-
sion of pose from images using conv-nets [
31, 19, 3] they
did not include a comprehensive set of benchmarks or lever-
age modern deep architectures.
In applications where accurate head pose estimation is
required, a common solution is to utilize RGBD (depth)
cameras. These can be very accurate, but suffer from a
number of limitations: First, because they use active sens-
ing, they can be difficult to use outdoors and in uncontrolled
1
2187