
Face Detection, Pose Estimation, and Landmark Localization in the Wild
Xiangxin Zhu Deva Ramanan
Dept. of Computer Science, University of California, Irvine
{xzhu,dramanan}@ics.uci.edu
Abstract
We present a unified model for face detection, pose es-
timation, and landmark estimation in real-world, cluttered
images. Our model is based on a mixtures of trees with
a shared pool of parts; we model every facial landmark
as a part and use global mixtures to capture topological
changes due to viewpoint. We show that tree-structured
models are surprisingly effective at capturing global elas-
tic deformation, while being easy to optimize unlike dense
graph structures. We present extensive results on standard
face benchmarks, as well as a new “in the wild” annotated
dataset, that suggests our system advances the state-of-the-
art, sometimes considerably, for all three tasks. Though our
model is modestly trained with hundreds of faces, it com-
pares favorably to commercial systems trained with billions
of examples (such as Google Picasa and face.com).
1. Introduction
The problem of finding and analyzing faces is a founda-
tional task in computer vision. Though great strides have
been made in face detection, it is still challenging to ob-
tain reliable estimates of head pose and facial landmarks,
particularly in unconstrained “in the wild” images. Ambi-
guities due to the latter are known to be confounding factors
for face recognition [42]. Indeed, even face detection is ar-
guably still difficult for extreme poses.
These three tasks (detection, pose estimation, and land-
mark localization) have traditionally been approached as
separate problems with a disparate set of techniques, such as
scanning window classifiers, view-based eigenspace meth-
ods, and elastic graph models. In this work, we present a
single model that simultaneously advances the state-of-the-
art, sometimes considerably, for all three. We argue that
a unified approach may make the problem easier; for ex-
ample, much work on landmark localization assumes im-
ages are pre-filtered by a face detector, and so suffers from
a near-frontal bias.
Our model is a novel but simple approach to encoding
elastic deformation and three-dimensional structure; we use
Figure 1: We present a unified approach to face detection,
pose estimation, and landmark estimation. Our model is
based on a mixture of tree-structured part models. To eval-
uate all aspects of our model, we also present a new, anno-
tated dataset of “in the wild” images obtained from Flickr.
mixtures of trees with a shared pool of parts (see Figure 1).
We define a “part” at each facial landmark and use global
mixtures to model topological changes due to viewpoint; a
part will only be visible in certain mixtures/views. We allow
different mixtures to share part templates. This allows us to
model a large number of views with low complexity. Fi-
nally, all parameters of our model, including part templates,
modes of elastic deformation, and view-based topology, are
discriminatively trained in a max-margin framework.
Notably, most previous work on landmark estimation use
densely-connected elastic graphs [39, 9] which are difficult
to optimize. Consequently, much effort in the area has fo-
cused on optimization algorithms for escaping local min-
ima. We show that multi-view trees are an effective alter-
native because (1) they can be globally optimized with dy-
namic programming and (2) surprisingly, they still capture
much relevant global elastic structure.
We present an extensive evaluation of our model for
face detection, pose estimation, and landmark estimation.
We compare to the state-of-the-art from both the academic
community and commercial systems such as Google Picasa
1