
5
networks. Therefore, there is a recent growing interest in
leveraging heatmaps to represent the joint locations and
developing effective CNN architectures for HPE, e.g., [53]
[54] [39] [55] [56] [38] [40] [57] [58] [59] [60] [61] [62] [63]
[64]. Tompson et al. [53] combined CNN-based body part
detector with a part-based spatial-model into a unified
learning framework for 2D HPE. Lifshitz et al. [55] proposed
a CNN-based method for predicting the locations of joints.
It incorporates the keypoints votes and joint probabilities
to determine the human pose representation. Wei et al.
[40] introduced a convolutional networks-based sequential
framework named Convolutional Pose Machines (CPM) to
predict the locations of key joints with multi-stage processing
(the convolutional networks in each stage utilize the 2D
belief maps generated from previous stages and produce
the increasingly refined predictions of body part locations).
Newell et al. [38] proposed an encoder-decoder network
named ”stacked hourglass” (the encoder in this network
squeezes features through bottleneck and then the decoder
expands them) to repeat bottom-up and top-down process-
ing with intermediate supervision. The stacked hourglass
(SHG) network consists of consecutive steps of pooling and
upsampling layers to capture information at every scale.
Since then, complex variations of the SHG architecture were
developed for HPE. Chu et al. [65] designed novel Hourglass
Residual Units (HRUs), which extend the residual units with
a side branch of filters with larger receptive fields, to capture
features from various scales. Yang et al. [59] designed a
multi-branch Pyramid Residual Module (PRM) to replace
the residual unit in SHG, leading to enhanced invariance in
scales of deep CNNs.
With the emergence of Generative Adversarial Networks
(GANs) [66], they are explored in HPE to generate bio-
logically plausible pose configurations and to discriminate
the predictions with high confidence from those with low
confidence, which could infer the potential poses for the
occluded body parts. Chen et al. [67] constructed a structure-
aware conditional adversarial network, named Adversarial
PoseNet, which contains an hourglass network-based pose
generator and two discriminators to discriminate against
reasonable body poses from unreasonable ones. Chou et
al. [68] built an adversarial learning-based network with
two stacked hourglass networks sharing the same structure
as discriminator and generator, respectively. The generator
estimates the location of each joint, and the discriminator
distinguishes the ground-truth heatmaps and predicted ones.
Different from GANs-based methods that take HPE network
as the generator and utilize the discriminator to provide
supervision, Peng et al. [69] developed an adversarial data
augmentation network to optimize data augmentation and
network training by treating HPE network as a discriminator
and using augmentation network as a generator to perform
adversarial augmentations.
Besides these efforts in effective network design for HPE,
body structure information is also investigated to provide
more and better supervision information for building HPE
networks. Yang et al. [70] designed an end-to-end CNN
framework for HPE, which is able to find hard negatives
by incorporating the spatial and appearance consistency
among human body parts. A structured feature-level learning
framework was proposed in [71] for reasoning the correla-
tions among human body joints in HPE, which captures
richer information of human body joints and improves
the learning results. Ke et al. [72] designed a multi-scale
structure-aware neural network, which combines multi-scale
supervision, multi-scale feature combination, structure-aware
loss information scheme, and a keypoint masking training
method to improve HPE models in complex scenarios. Tang
et al. [73] built a hourglass-based supervision network,
termed as Deeply Learned Compositional Model, to describe
the complex and realistic relationships among body parts and
learn the compositional pattern information (the orientation,
scale and shape information of each body part) in human
bodies. Tang and Wu [74] revealed that not all parts are
related to each other, therefore introduced a Part-based
Branches Network to learn representations specific to each
part group rather than a shared representation for all parts.
Human poses in video sequences are (3D) spatio-
temporal signals. Therefore, modeling the spatio-temporal
information is important for HPE from videos. Jain et al.
[75] designed a two-branch CNN framework to incorporate
both color and motion features within frame pairs to build
an expressive temporal-spatial model in HPE. Pfister et
al. [76] proposed a convolutional network that is able to
utilize temporal context information from multiple frames
by using optical flow to align predicted heatmaps from
neighbouring frames. Different from the previous video-
based methods which are computationally intensive, Luo
et al. [60] introduced a recurrent structure for HPE with
Long Short-Term Memory (LSTM) [77] to capture temporal
geometric consistency and dependency from different frames.
This method results in a faster speed in training the HPE
network for videos. Zhang et al. [78] introduced a key
frame proposal network for capturing spatial and temporal
information from frames and a human pose interpolation
module for efficient video-based pose estimation.
3.2 2D multi-person pose estimation
Compared to single-person HPE, multi-person HPE is more
difficult and challenging because it needs to figure out
the number of people and their positions, and how to
group keypoints for different people. In order to solve
these problems, multi-person HPE methods can be classified
into top-down and bottom-up methods. Top-down methods
employ off-the-shelf person detectors to obtain a set of boxes
(each corresponding to one person) from the input images,
and then apply single-person pose estimators to each person
box to generate multi-person poses. Different from top-down
methods, bottom-up methods locate all the body joints in
one image first and then group them to the corresponding
subjects. In the top-down pipeline, the number of people in
the input image will directly affect the computing time. The
computing speed for bottom-up methods is usually faster
than top-down methods since they do not need to detect
the pose for each person separately. Fig. 4 shows the general
frameworks for 2D multi-person HPE methods.
3.2.1 Top-down pipeline
In the top-down pipeline as shown in Fig. 4 (a), there are two
important parts: a human body detector to obtain person
bounding boxes and a single-person pose estimator to predict