深度学习框架：基于关节深度预测的单眼3D人体姿态估计

159 浏览量更新于2024-08-26 收藏 738KB PDF 举报

"这篇研究论文探讨了通过预测关节深度来解决单眼3D人体姿态估计中的挑战。作者Bruce Xiaohan Nie、Ping Wei和Song-Chun Zhu提出了一种新颖的框架，该框架基于2D人体关节位置和身体部位图像来预测关节的深度，从而减少将2D姿态提升到3D空间时的不确定性。该方法采用了层次结构的长短期记忆（LSTM）网络，可以端到端训练。" 在这篇论文中，作者主要关注的是单目3D人体姿态估计问题，这是一个极具挑战性的任务，因为从单个摄像头的二维图像中推断三维人体姿势会存在固有的不确定性。为了解决这个问题，他们提出了一种新的框架，该框架的核心是预测关节的深度。这种预测能够帮助减少将2D姿态映射到3D空间时的模糊性。该方法构建在两级LSTM网络之上。第一级网络由两个组件构成：1) 骨骼LSTM，它从全局人类骨架特征中学习深度信息；2) 补丁LSTM，它利用关节周围局部图像证据。这两个网络都具有基于人体骨骼动力学关系的树状结构，使得不同关节的信息可以在整个骨架中传播，增强了信息的共享和融合。骨骼LSTM负责从全局视角捕捉人体骨架的整体深度模式，而补丁LSTM则专注于关节附近的局部图像细节，提取与深度相关的上下文信息。通过结合这两种LSTM，模型能够在考虑局部和全局信息的同时，更准确地估计关节的3D位置。此外，论文中提到，这种端到端的训练方式使得整个系统可以从原始输入图像直接学习到深度预测，无需预先进行复杂的预处理步骤。这种方法的优点在于提高了系统的自动化程度，并可能提升姿态估计的准确性。这篇研究论文提出了一个创新的深度学习模型，通过预测关节深度来克服单目3D人体姿态估计中的难题。利用LSTM网络的递归特性和多层次结构，该方法能够有效地整合图像特征和骨架信息，以提高三维人体姿态估计的精度。这为未来的人工智能应用，如虚拟现实、增强现实以及运动分析等领域提供了重要的技术参考。

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Bruce Xiaohan Nie

1∗

, Ping Wei

2,1∗

, and Song-Chun Zhu

Center for Vision, Cognition, Learning, and Autonomy, UCLA, USA

Xi’an Jiaotong University, China

xiaohan.nie@gmail.com, pingwei@xjtu.edu.cn, sczhu@stat.ucla.edu

Abstract

This paper aims at estimating full-body 3D human pos-

es from monocular images of which the biggest challenge

is the inherent ambiguity introduced by lifting the 2D pose

into 3D space. We propose a novel framework focusing on

reducing this ambiguity by predicting the depth of human

joints based on 2D human joint locations and body part

images. Our approach is built on a two-level hierarchy of

Long Short-Term Memory (LSTM) Networks which can be

trained end-to-end. The ﬁrst level consists of two compo-

nents: 1) a skeleton-LSTM which learns the depth informa-

tion from global human skeleton features; 2) a patch-LSTM

which utilizes the local image evidence around joint loca-

tions. The both networks have tree structure deﬁned on the

kinematic relation of human skeleton, thus the information

at different joints is broadcast through the whole skeleton in

a top-down fashion. The two networks are ﬁrst pre-trained

separately on different data sources and then aggregated in

the second layer for ﬁnal depth prediction. The empirical e-

valuation on Human3.6M and HHOI dataset demonstrates

the advantage of combining global 2D skeleton and local

image patches for depth prediction, and our superior quan-

titative and qualitative performance relative to state-of-the-

art methods.

1. Introduction

1.1. Motivation and Objective

This paper aims at reconstructing full-body 3D human

poses from a single RGB image. Speciﬁcally we want to lo-

calize the human joints in 3D space, as shown in Fig. 1. Es-

timating 3D human pose is a classic task in computer vision

and serves as a key component in many human related prac-

tical applications such as intelligent surveillance, human-

robot interaction, human activity analysis, human attention

recognition,etc. There are some existing works which esti-

mate 3D poses in constrained environment from depth im-

∗

Bruce Xiaohan Nie and Ping Wei are co-ﬁrst authors.

6NHOHWRQ

/670

3DWFK

/670

)LUVWOHYHO

6HFRQG

OHYHO

/670

=KHDG

=HOERZ

=KDQG



HDG



HOER



Figure 1. Our two-level hierarchy of LSTM for 3D pose estima-

tion. The skeleton-LSTM and patch-LSTM captures information

from global 2D skeleton and local image patches respectively at

the ﬁrst level. The global and local features are integrated in the

second level to predict the depth on joints. The 3D pose is recov-

ered by attaching depth values onto the 2D pose. We render the

3D pose for better visualization.

ages [40, 26] or RGB images captured simultaneously at

multiple viewpoints [2, 10]. Different from them, we focus

on recognizing 3D pose directly from a single RGB image

which is easier to be captured from general environment.

Estimating 3D human poses from a single RGB image

is a challenging problem due to two main reasons: 1) the

target person in the image always exhibits large appear-

ance and geometric variation because of different clothes,

postures, illuminations, camera viewpoints and so on. The

highly articulated human pose also brings about heavy self-

occlusions. 2) even the ground-truth 2D pose is given, re-

covering the 3D pose is inherently ambiguous since that

there are inﬁnite 3D poses which can be projected onto the

same 2D pose when the depth information is unknown.

One inspiration of our work is the huge progress of 2D

human pose estimation made by recent works based on deep

architectures [33, 32, 17, 37, 3]. In those works, the appear-

ance and geometric variation are implicitly modeled in feed-

forward computations in networks with hierarchical deep

structure. The self-occlusion is also addressed well by ﬁl-

ters from different layers capturing features at different s-

2017 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2017.373

3467

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38621638

粉丝: 1

深度学习框架：基于关节深度预测的单眼3D人体姿态估计

EvoSkeleton:CVPR 2020论文（口头陈述）的官方项目网站“级联的深度单眼3D人体姿势估计和进化训练数据”

3d-human-pose-estimation:有关3D人体姿势估计的重要论文

EvoSkeleton项目：单眼3D人体姿势估计与进化数据训练

merged_depth:单眼深度估计-来自多个预训练深度估计模型的加权平均预测

单眼相机的深度预测：使用转移学习从单眼相机进行深度预测

DenseDepth：通过转移学习进行高质量单眼深度估计

lifting_events_to_3d_hpe:“将单眼事件提升到3D人体姿势”的代码-CVPRw 2021

MiDaS:在“ Ranftl等人，迈向稳健的单眼深度估计”中描述的用于稳健的单眼深度估计的代码

matlab有些代码不运行-LapDepth-release:基于拉普拉斯金字塔深度残差的单眼深度估计

MocapNET:我们提出了MocapNET2，这是一种实时方法，可以根据流行的生物视觉层次结构（BVH）格式直接估算3D人体姿势，并给出源自单眼彩色图像的2D人体关节的估算值。 我们的贡献包括

最新资源

MocapNET:我们提出了MocapNET2，这是一种实时方法，可以根据流行的生物视觉层次结构（BVH）格式直接估算3D人体姿势，并给出源自单眼彩色图像的2D人体关节的估算值。我们的贡献包括