arXiv:1612.00137v4 [cs.CV] 2 Sep 2017
RMPE: Regional Multi-Person Pose Estimation
Hao-Shu Fang
1∗
, Shuqin Xie
1
, Yu-Wing Tai
2
, Cewu Lu
1§
1
Shanghai Jiao Tong University, China
2
Tencent YouTu
fhaoshu@gmail.com qweasdshu@sjtu.edu.cn yuwingtai@tencent.com lucewu@sjtu.edu.cn
Abstract
Multi-person pose estimation in the wild is challenging.
Although state-of-the-art human detectors have demon-
strated good performance, small errors in localization and
recognition are inevitable. These errors can cause failures
for a single-person pose estimator (SPPE), especially for
methods that solely depend on human detection results. In
this paper, we propose a novel regional multi-person pose
estimation (RMPE) framework to facilitate pose estimation
in the presence of inaccurate human bounding boxes. Our
framework consists of three components: Symmetric Spa-
tial Transformer Network (SSTN), Parametric Pose Non-
Maximum-Suppression (NMS), and Pose-Guided Proposals
Generator (PGPG). Our method is able to handle inaccu-
rate bounding boxes and redundant detections, allowing it
to achieve 76.7 mAP on the MPII (multi person) dataset[
3].
Our model and source codes are made publicly available.
†
.
1. Introduction
Human pose estimation is a fundamental challen ge for
computer vision. In practice, recognizing the pose of
multiple persons in the wild is a lot more challenging
than recogn izing the pose of a single person in an im-
age [
30, 31, 21, 23, 38]. Recent attempts approach this
problem by using either a two-step framework [28, 12] or a
part-based fra mework [
7, 27, 17]. Th e two-step framework
first detects huma n bounding boxes and then estimates the
pose within each box independently. The part-based frame-
work first detects body parts independently and the n assem-
bles the detected body parts to form multiple human poses.
Both frameworks have their advantages and disadvantages.
In the two-step framework, the accuracy of pose estima-
tion highly dep ends on the quality of the detected b ound-
ing boxes. In the part-based framework, the assembled hu-
∗
part of this work was done when Hao-Shu Fang was an student intern
in Tencent
§
corresponding author is Cewu Lu
†
https://cvsjtu.wordpress.com/rmpe-regional-multi-person-pose-estimation/
man poses a re ambigu ous when two or more per sons are too
close together. Also, part-based framework loses the cap a -
bility to recognize body parts from a global pose view due to
the mere utilization of second-order body parts dependence.
Our approach follows the two-step framework. We aim
to detect accura te human poses even when given inaccu -
rate bounding boxes. To illustrate the problems of previous
approa c hes, we applied the state-of-the-art object detector
Faster-RCNN [
29] and the SPPE Stacked Hourglass model
[
23]. Figure 1 and Figure 2 sh ow two major problems:
the lo calization error problem and the redundant detection
problem. In fact, SPPE is r ather vulner able to bounding
box errors. Even for the cases when the bo unding boxes
are considered as correct with IoU > 0.5, the detected hu-
man poses can still be wrong. Since SPPE produces a pose
for each given bound ing box, redund a nt detections result in
redundant poses.
To address the above problems, a regional multi-person
pose estimation (RMPE) framework is proposed. Ou r
framework improves the performance of SPPE-based hu-
man pose estimation algorithms. We have designed a new
symmetric spatial transformer ne twork (SSTN) which is at-
tached to the SPPE to extract a high-quality single pe rson
region from an inaccurate bounding box. A novel paral-
lel SPPE branch is introduced to optimize this network. To
address the problem of redundant detection, a parametric
pose NMS is introduced. Our parametric pose NMS elimi-
nates redunda nt poses by using a novel pose distance met-
ric to compare pose similarity. A data-driven approach is
applied to optimize the pose distance parameters. Lastly,
we propose a n ovel pose- guided human proposal genera-
tor (PGPG) to augment training samples. By learnin g the
output distribution of a human dete ctor for different poses,
we can simulate the gene ration of human bounding boxes,
producing a large sample of training data.
Our RMPE framework is general and is applicable to
different human detectors and single person pose estima-
tors. We applied our framework on the MPII (multi-person )
dataset [
3], where it o utperforms the state-of-the-art meth-
ods and achieves 76.7 mAP. We have also conducted ab-
lation studies to validate the effectiveness of each pro-
4321