深度学习人体姿态估计：技术现状与挑战

5星 · 超过95%的资源 | 下载需积分: 46 | PDF格式 | 13.32MB | 更新于2024-07-14 | 65 浏览量 | 举报

2 收藏

"这篇综述论文深入探讨了基于深度学习的人体姿态估计技术，涵盖了过去十年的研究进展。文章详尽地分析和比较了超过240篇自2014年以来的相关研究，针对2D和3D姿态估计的输入数据与推断过程进行了系统性的阐述。" 深度学习在人体姿态估计中的应用已经成为计算机视觉领域的热点，它通过复杂的神经网络模型，从图像或视频数据中识别并定位人体各个部位。这项技术的核心在于理解图像内容并构建精确的人体表示，例如人体骨架模型。首先，2D人体姿态估计是该领域的一个基础任务，它旨在从单个二维图像中检测和定位人体的关键关节，如肩、肘、腕、髋、膝和脚踝。近年来，深度学习模型，特别是卷积神经网络（CNN）的引入，极大地提升了2D姿态估计的精度。这些模型通常包含多层特征提取，用于从低级到高级的图像特征学习，最后通过分类或回归任务来预测关节的位置。尽管2D姿态估计在许多应用中已经非常成功，但3D人体姿态估计更具有挑战性，因为它需要在三维空间中恢复人体的姿态。这需要解决深度信息的不确定性，例如由于单目视觉的深度信息缺失导致的深度模糊问题。为了克服这个问题，研究者们提出了多种方法，包括使用深度相机获取立体信息，或者利用深度学习模型来学习深度映射。此外，遮挡是另一个常见的问题，尤其是在多人场景中，一个人可能会部分或完全遮挡住另一个人。为此，研究者们开发了复杂的网络架构，如多分支结构或自注意力机制，以处理遮挡带来的信息损失。论文还可能涵盖了数据集的创建、标注技术、训练策略以及评估指标等方面，这些都是推动深度学习模型在人体姿态估计上取得进步的关键因素。数据集如MPII、COCO和MPI3D等，提供了大量标注好的图像，供研究者训练和验证模型。同时，预训练模型、迁移学习和数据增强技术也极大地促进了模型的泛化能力。这篇综述论文全面回顾了深度学习在人体姿态估计领域的最新进展，对现有方法进行了深入的比较分析，对于研究人员和从业者来说，是一份宝贵的参考资料，有助于他们了解该领域的最新动态和技术趋势。

展开

networks. Therefore, there is a recent growing interest in

leveraging heatmaps to represent the joint locations and

developing effective CNN architectures for HPE, e.g., [53]

[54] [39] [55] [56] [38] [40] [57] [58] [59] [60] [61] [62] [63]

[64]. Tompson et al. [53] combined CNN-based body part

detector with a part-based spatial-model into a uniﬁed

learning framework for 2D HPE. Lifshitz et al. [55] proposed

a CNN-based method for predicting the locations of joints.

It incorporates the keypoints votes and joint probabilities

to determine the human pose representation. Wei et al.

[40] introduced a convolutional networks-based sequential

framework named Convolutional Pose Machines (CPM) to

predict the locations of key joints with multi-stage processing

(the convolutional networks in each stage utilize the 2D

belief maps generated from previous stages and produce

the increasingly reﬁned predictions of body part locations).

Newell et al. [38] proposed an encoder-decoder network

named ”stacked hourglass” (the encoder in this network

squeezes features through bottleneck and then the decoder

expands them) to repeat bottom-up and top-down process-

ing with intermediate supervision. The stacked hourglass

(SHG) network consists of consecutive steps of pooling and

upsampling layers to capture information at every scale.

Since then, complex variations of the SHG architecture were

developed for HPE. Chu et al. [65] designed novel Hourglass

Residual Units (HRUs), which extend the residual units with

a side branch of ﬁlters with larger receptive ﬁelds, to capture

features from various scales. Yang et al. [59] designed a

multi-branch Pyramid Residual Module (PRM) to replace

the residual unit in SHG, leading to enhanced invariance in

scales of deep CNNs.

With the emergence of Generative Adversarial Networks

(GANs) [66], they are explored in HPE to generate bio-

logically plausible pose conﬁgurations and to discriminate

the predictions with high conﬁdence from those with low

conﬁdence, which could infer the potential poses for the

occluded body parts. Chen et al. [67] constructed a structure-

aware conditional adversarial network, named Adversarial

PoseNet, which contains an hourglass network-based pose

generator and two discriminators to discriminate against

reasonable body poses from unreasonable ones. Chou et

al. [68] built an adversarial learning-based network with

two stacked hourglass networks sharing the same structure

as discriminator and generator, respectively. The generator

estimates the location of each joint, and the discriminator

distinguishes the ground-truth heatmaps and predicted ones.

Different from GANs-based methods that take HPE network

as the generator and utilize the discriminator to provide

supervision, Peng et al. [69] developed an adversarial data

augmentation network to optimize data augmentation and

network training by treating HPE network as a discriminator

and using augmentation network as a generator to perform

adversarial augmentations.

Besides these efforts in effective network design for HPE,

body structure information is also investigated to provide

more and better supervision information for building HPE

networks. Yang et al. [70] designed an end-to-end CNN

framework for HPE, which is able to ﬁnd hard negatives

by incorporating the spatial and appearance consistency

among human body parts. A structured feature-level learning

framework was proposed in [71] for reasoning the correla-

tions among human body joints in HPE, which captures

richer information of human body joints and improves

the learning results. Ke et al. [72] designed a multi-scale

structure-aware neural network, which combines multi-scale

supervision, multi-scale feature combination, structure-aware

loss information scheme, and a keypoint masking training

method to improve HPE models in complex scenarios. Tang

et al. [73] built a hourglass-based supervision network,

termed as Deeply Learned Compositional Model, to describe

the complex and realistic relationships among body parts and

learn the compositional pattern information (the orientation,

scale and shape information of each body part) in human

bodies. Tang and Wu [74] revealed that not all parts are

related to each other, therefore introduced a Part-based

Branches Network to learn representations speciﬁc to each

part group rather than a shared representation for all parts.

Human poses in video sequences are (3D) spatio-

temporal signals. Therefore, modeling the spatio-temporal

information is important for HPE from videos. Jain et al.

[75] designed a two-branch CNN framework to incorporate

both color and motion features within frame pairs to build

an expressive temporal-spatial model in HPE. Pﬁster et

al. [76] proposed a convolutional network that is able to

utilize temporal context information from multiple frames

by using optical ﬂow to align predicted heatmaps from

neighbouring frames. Different from the previous video-

based methods which are computationally intensive, Luo

et al. [60] introduced a recurrent structure for HPE with

Long Short-Term Memory (LSTM) [77] to capture temporal

geometric consistency and dependency from different frames.

This method results in a faster speed in training the HPE

network for videos. Zhang et al. [78] introduced a key

frame proposal network for capturing spatial and temporal

information from frames and a human pose interpolation

module for efﬁcient video-based pose estimation.

3.2 2D multi-person pose estimation

Compared to single-person HPE, multi-person HPE is more

difﬁcult and challenging because it needs to ﬁgure out

the number of people and their positions, and how to

group keypoints for different people. In order to solve

these problems, multi-person HPE methods can be classiﬁed

into top-down and bottom-up methods. Top-down methods

employ off-the-shelf person detectors to obtain a set of boxes

(each corresponding to one person) from the input images,

and then apply single-person pose estimators to each person

box to generate multi-person poses. Different from top-down

methods, bottom-up methods locate all the body joints in

one image ﬁrst and then group them to the corresponding

subjects. In the top-down pipeline, the number of people in

the input image will directly affect the computing time. The

computing speed for bottom-up methods is usually faster

than top-down methods since they do not need to detect

the pose for each person separately. Fig. 4 shows the general

frameworks for 2D multi-person HPE methods.

3.2.1 Top-down pipeline

In the top-down pipeline as shown in Fig. 4 (a), there are two

important parts: a human body detector to obtain person

bounding boxes and a single-person pose estimator to predict

Human

Detector

Input image

Detected human subjects

2D Pose

Network

Output 2D multi-person poses

Single-person pose

(a) Top-Down Approaches

Input image

2D Pose

Network

Body Part

Association

Output 2D multi-person poses

Body part candidates detection

(b) Bottom-Up Approaches

Fig. 4: Illustration of the multi-person 2D HPE frameworks.

(a) Top-down approaches have two sub-tasks: (1) human

detection and (2) pose estimation in the region of a singe

human; (b) Bottom-up approaches also have two sub-tasks:

(1) detect all keypoints candidates of body parts and (2)

associate body parts in different human bodies and assemble

them into individual pose representations.

the locations of keypoints within these bounding boxes. A

line of works focus on designing and improving the modules

in HPE networks, e.g., [79] [80] [62] [81] [82] [83] [84] [85] [86]

[87]. For example, in order to answer the question ”how good

could a simple method be” in building an HPE network, Xiao

et al. [62] added a few deconvolutional layers in the ResNet

(backbone network) to build a simple yet effective structure

to produce heatmaps for high-resolution representations. Sun

et al. [81] presented a novel High-Resolution Net (HRNet) to

learn reliable high resolution representations by connecting

mutli-resolution subnetworks in parallel and conducting

repeated multi-scale fusions. To improve the accuracy of

keypoint localization, Wang et al. [84] introduced a two-stage

graph-based and model-agnostic framework, called Graph-

PCNN. It consists of a localization subnet to obtain rough

keypoint locations and a graph pose reﬁnement module to

get reﬁned keypoints localization representations. In order

to obtain more precise keypoints localization, Cai et al. [86]

introduced a multi-stage network with a Residual Steps

Network (RSN) module to learn delicate local representations

by efﬁcient intra-level feature fusion strategies, and a Pose

Reﬁne Machine (PRM) module to ﬁnd a trade-off between

local and global representations in the features.

Estimating poses under occlusion and truncation scenes

often occurs in multi-person settings since the overlapping of

limbs is inevitable. Human detectors may fail in the ﬁrst step

of top-down pipeline due to occlusion or truncation. There-

fore, robustness to occlusion or truncation is an important

aspect of the multi-person HPE approaches. Towards this

goal, Iqbal and Gall [88] built a convolutional pose machine-

based pose estimator to estimate the joint candidates. Then

they used integer linear programming (ILP) to solve the

joint-to-person association problem and obtain the human

body poses even in presence of severe occlusions. Fang et al.

[89] designed a novel regional multi-person pose estimation

(RMPE) approach to improve the performance of HPE in com-

plex scenes. Speciﬁcally, RMPE framework has three parts:

Symmetric Spatial Transformer Network (to detect single

person region within inaccurate bounding box), Parametric

Pose Non-Maximum-Suppression (to solve the redundant

detection problem), and Pose-Guided Proposals Generator

(to augment training data). Papandreou et al. [79] proposed a

two-stage architecture, consisting of a Faster R-CNN person

detector to create bounding boxes for candidate human

bodies and a keypoint estimator to predict the locations

of keypoints by using a form of heatmap-offset aggregation.

The overall method works well in occluded and cluttered

scenes. In order to alleviate the occlusion problem in HPE,

Chen et al. [90] presented a Cascade Pyramid Network (CPN)

which includes two parts: GlobalNet (a feature pyramid

network to predict the invisible keypoints like eyes or hands)

and ReﬁneNet (a network to integrate all levels of features

from the GlobalNet with a keypoint mining loss). Their

results reveal that CPN has a good performance in predicting

occluded keypoints. Su et al. [91] designed two modules, the

Channel Shufﬂe Module and the Spatial & Channel-wise

Attention Residual Bottleneck, to achieve channel-wise and

spatial information enhancement for better multi-person pose

estimation under occluded scenes. Qiu et al. [92] developed

an Occluded Pose Estimation and Correction (OPEC-Net)

module and an occluded pose dataset to solve the occlusion

problem in crowd pose estimation. Umer et al. [93] proposed

a keypoint correspondence framework to recover missed

poses using temporal information of the previous frame in

occluded scene. The network is trained using self-supervision

in order to improve the pose estimation results in sparsely

annotated video datasets.

3.2.2 Bottom-up pipeline

The bottom-up pipeline (e.g., [94] [95] [96] [17] [97] [98] [99]

[100] [101] [102] [103]) has two main steps including body

joint detection (i.e., extracting local features and predicting

human body joint candidates) and joint candidates assem-

bling for individual bodies (i.e., grouping joint candidates

to build ﬁnal pose representations with part association

strategies) as illustrated in Fig. 4 (b).

Pishchulin et al. [94] proposed a Fast R-CNN based body

part detector named DeepCut, which is one of the earliest

two-step bottom-up approaches. It ﬁrst detects all the body

part candidates, then labels each part and assembles these

parts using integer linear programming (ILP) to a ﬁnal pose.

However, DeepCut model is computationally expensive. To

this end, Insafutdinov et al. [95] introduced DeeperCut to

improve DeepCut by applying a stronger body part detector

with a better incremental optimization strategy and image-

conditioned pairwise terms to group body parts, leading to

improved performance as well as a faster speed. Later, Cao

et al. [17] built a detector named OpenPose, which uses Con-

volutional Pose Machines [40] (CPMs) to predict keypoints

coordinates via heatmaps and Part Afﬁnity Fields (PAFs, a set

of 2D vector ﬁelds with vectormaps that encode the position

and orientation of limbs) to associate the keypoints to each

person. OpenPose largely accelerates the speed of the bottom-

up multi-person HPE. Based on the OpenPose framework,

Zhu et al. [104] improved the OpenPose structure by adding

redundant edges to increase the connections between joints

in PAFs and obtained better performance than the baseline

approach. Although OpenPose-based methods have achieved

impressive results on high resolution images, they have poor

performance with low resolution images and occlusions. To

剩余25页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

syp_net

粉丝: 158

深度学习人体姿态估计：技术现状与挑战

2019年顶级会议3D人体姿态估计算法综述

单目三维人体姿态估计：挑战、进展与未来趋势

2D与3D人体姿势估计研究论文综述

最新「智能时尚计算机视觉技术」综述论文

25篇最新CV领域2020综述性论文传送！(涵盖15个方向).zip

2D单人姿态估计

人体骨骼关键点检测综述(2)

深度图像人体运动分析研究综述

张峰解读：电子科技大学2D单人姿态估计挑战与方法综述

2017-2018年3D人体姿势估计算法发展综述

最新资源