arXiv:1508.06708v1 [cs.CV] 27 Aug 2015
Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose
Estimation
Sijin Li
sijin.li@my.cityu.edu.hk
Weichen Zhang
wczhang4-c@my.cityu.edu.hk
Department of Computer Science
City University of Hong Kong
Antoni B. Chan
abchan@cityu.edu.hk
Abstract
This paper focuses on structured-output learning using
deep neural networks for 3D human pose estimation from
monocular images. Our network takes an image and 3D
pose as inputs and outputs a score value, which is high when
the image-pose pair matches and low otherwise. The net-
work structure consists of a convolutional neural network
for image feature extraction, followed by two sub-networks
for transforming the image features and pose into a joint
embedding. The score function is then the dot-product be-
tween the image and pose embeddings. The image-pose
embedding and score function are jointly trained using a
maximum-margin cost function. Our proposed framework
can be interpreted as a special form of structured support
vector machines where the joint feature space is discrimi-
natively learned using deep neural networks. We test our
framework on the Human3.6m dataset and obtain state-of-
the-art results compared to other recent methods. Finally,
we present visualizations of the image-pose embedding
space, demonstrating the network has learned a high-level
embedding of body-orientation and pose-configuration.
1. Introduction
Human pose estimation from images has been studies for
decades. Due to the dependencies among joint points, it can
be considered a structured-output task. In general, human
pose estimation approaches can be divided by two types:
1) prediction-based methods; 2) optimization-based meth-
ods. The first type of approach views pose estimation as a
regression or detection problem [
18, 31, 19, 30, 14]. The
goal is to learn the mapping from the input space (image
features) to the target space (2D or 3D joint points), or to
learn classifiers to detect specific body parts in the image.
This type of method is straightforward and usually fast in
the evaluation stage. Toshev et al. [
31] trained a cascaded
network to refine the 2D joint locations in an image stage
by stage. However, this approach does not explicitly con-
sider the structured constraints of human pose. Followup
work [
14, 30] learned the pairwise relationship between 2D
joint positions, and incorporated them into the joint pre-
dictions. Limitations of prediction-based methods include:
the manually-designed constraints might not be able to fully
capture the dependencies among the body joints; poor scal-
ability to 3D joint estimation when the search space needs
to be discretized; prediction of only a single pose when mul-
tiple poses might be valid due to partial self-occlusion.
Instead of estimating the target directly, the second type
of approach learns a score function, which takes both an im-
age and a pose as inputs, and produces a high score for cor-
rect image-pose pairs and low scores for unmatched image-
pose pairs. Given an input image x, the estimated pose y
∗
is the pose that maximizes the score function, i.e.,
y
∗
= argmax
y∈Y
f(x, y), (1)
where Y is the pose space. If the score function can be
properly normalized, then it can be interpreted as a proba-
bility distribution, either a conditional distribution of poses
given the image, or a joint distribution over both images and
joints. One popular model is pictorial structures [9], where
the dependencies between joints are represented by edges
in a probabilistic graphical model [
16]. As an alternative
to generative models, structured-output SVM [32] is a dis-
criminative method for learning a score function, which en-
sures a large margin between the score values for correct
input pairs and for incorrect input pairs [
24, 10].
As the score function takes both image and pose as input,
there are several ways to fuse the image and pose informa-
tion together. For example, the features can be extracted
jointly according to the image and poses, e.g., the image
features extracted around the input joint positions could be
viewed as the joint feature representation of image and pose
[
9, 26, 34, 8]. Alternatively, features from the image and
pose can be extracted separately and concatenated, and the
score function trained to fuse them together [
11, 12]. How-
ever, with these methods, the features are hand-crafted, and
1