深度学习与球形部件模型结合的3D手部姿态估计

需积分: 10 95 浏览量更新于2024-09-07 1 收藏 1.24MB PDF 举报

"Learning a Deep Network with Spherical Part Model for 3D Hand Pose Estimation" 这篇研究论文探讨了如何使用深度神经网络与球形部分模型来估计三维手部姿态。作者们来自城市大学计算机科学系，包括Sijin Li、Weichen Zhang以及Antoni B. Chan。论文发表于arXiv，编号为arXiv:1508.06708v1，属于计算机视觉领域（cs.CV），发表日期为2015年8月27日。在3D人体姿态估计领域，尤其是从单目图像中进行估计，是一项具有挑战性的任务。论文提出了一种新的深度学习框架，该框架将图像和3D姿态作为输入，输出一个评分值，用于判断输入的图像与姿态匹配程度。匹配度高则分数高，反之分数低。网络结构由卷积神经网络（CNN）构成，用于提取图像特征，接着是两个子网络，分别将图像特征和姿态转换为联合嵌入空间。嵌入空间中的点积即为评分函数。为了训练这个模型，作者采用了最大边距成本函数，这是一种结构化学习方法。提出的框架可以视为结构化支持向量机（SVM）的一种特殊形式，其中联合特征空间通过深度神经网络进行判别性学习。这种方法能够优化网络对不同手部姿态的区分能力，提高估计的准确性。深度学习在图像处理和模式识别中的应用已经广泛，但在此项工作中，它被用来处理复杂的3D空间问题，尤其是手部的多关节运动。球形部分模型允许模型更好地理解手部的几何结构，而深度神经网络则提供了一种有效的工具来学习这些复杂关系。论文的主要贡献在于将深度学习与结构化输出学习相结合，解决了单目图像中3D手部姿态估计的难题。通过端到端的训练，模型可以直接从图像中捕获信息，并转化为高维度的表示，从而更准确地预测手部姿态。这种方法不仅提高了预测精度，还降低了对先验知识的依赖，使得模型更具泛化能力。此外，该研究还可能对其他领域的3D对象姿态估计产生启示，如人体动作识别、机器人视觉等。通过这种方式，可以进一步推动计算机视觉技术在虚拟现实、增强现实和人机交互等领域的应用。这篇论文为3D手部姿态估计提供了创新的解决方案，展示了深度学习在解决复杂视觉问题上的潜力。

arXiv:1508.06708v1 [cs.CV] 27 Aug 2015

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose

Estimation

Sijin Li

sijin.li@my.cityu.edu.hk

Weichen Zhang

wczhang4-c@my.cityu.edu.hk

Department of Computer Science

City University of Hong Kong

Antoni B. Chan

abchan@cityu.edu.hk

Abstract

This paper focuses on structured-output learning using

deep neural networks for 3D human pose estimation from

monocular images. Our network takes an image and 3D

pose as inputs and outputs a score value, which is high when

the image-pose pair matches and low otherwise. The net-

work structure consists of a convolutional neural network

for image feature extraction, followed by two sub-networks

for transforming the image features and pose into a joint

embedding. The score function is then the dot-product be-

tween the image and pose embeddings. The image-pose

embedding and score function are jointly trained using a

maximum-margin cost function. Our proposed framework

can be interpreted as a special form of structured support

vector machines where the joint feature space is discrimi-

natively learned using deep neural networks. We test our

framework on the Human3.6m dataset and obtain state-of-

the-art results compared to other recent methods. Finally,

we present visualizations of the image-pose embedding

space, demonstrating the network has learned a high-level

embedding of body-orientation and pose-conﬁguration.

1. Introduction

Human pose estimation from images has been studies for

decades. Due to the dependencies among joint points, it can

be considered a structured-output task. In general, human

pose estimation approaches can be divided by two types:

1) prediction-based methods; 2) optimization-based meth-

ods. The ﬁrst type of approach views pose estimation as a

regression or detection problem [

18, 31, 19, 30, 14]. The

goal is to learn the mapping from the input space (image

features) to the target space (2D or 3D joint points), or to

learn classiﬁers to detect speciﬁc body parts in the image.

This type of method is straightforward and usually fast in

the evaluation stage. Toshev et al. [

31] trained a cascaded

network to reﬁne the 2D joint locations in an image stage

by stage. However, this approach does not explicitly con-

sider the structured constraints of human pose. Followup

work [

14, 30] learned the pairwise relationship between 2D

joint positions, and incorporated them into the joint pre-

dictions. Limitations of prediction-based methods include:

the manually-designed constraints might not be able to fully

capture the dependencies among the body joints; poor scal-

ability to 3D joint estimation when the search space needs

to be discretized; prediction of only a single pose when mul-

tiple poses might be valid due to partial self-occlusion.

Instead of estimating the target directly, the second type

of approach learns a score function, which takes both an im-

age and a pose as inputs, and produces a high score for cor-

rect image-pose pairs and low scores for unmatched image-

pose pairs. Given an input image x, the estimated pose y

∗

is the pose that maximizes the score function, i.e.,

∗

= argmax

y∈Y

f(x, y), (1)

where Y is the pose space. If the score function can be

properly normalized, then it can be interpreted as a proba-

bility distribution, either a conditional distribution of poses

given the image, or a joint distribution over both images and

joints. One popular model is pictorial structures [9], where

the dependencies between joints are represented by edges

in a probabilistic graphical model [

16]. As an alternative

to generative models, structured-output SVM [32] is a dis-

criminative method for learning a score function, which en-

sures a large margin between the score values for correct

input pairs and for incorrect input pairs [

24, 10].

As the score function takes both image and pose as input,

there are several ways to fuse the image and pose informa-

tion together. For example, the features can be extracted

jointly according to the image and poses, e.g., the image

features extracted around the input joint positions could be

viewed as the joint feature representation of image and pose

[

9, 26, 34, 8]. Alternatively, features from the image and

pose can be extracted separately and concatenated, and the

score function trained to fuse them together [

11, 12]. How-

ever, with these methods, the features are hand-crafted, and

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_44505314

粉丝: 1
资源: 2

深度学习与球形部件模型结合的3D手部姿态估计

Ray-optics model for optical force and torque on a spherical metal-coated Janus microparticle

Spherical Gravitation Model: 简单的球形引力模型-matlab开发

Particle digital in-line holography with spherical wave recording

Source localization with minimum variance distortionless response for spherical microphone arrays

【船级社】 DNV Liquefied gas carriers with spherical tanks of type B

SPHnet:文章的 TensorflowKeras 代码（Effective Rotation-invariant Point CNN with Spherical Harmonics kernels）

GPU accelerated simplified harmonic spherical approximation equations for three-dimensional optical imaging

A fuzzy PID control method for the underwater spherical robot

Autonomous Wheeled Robot Navigation with Uncalibrated Spherical Images

Multipole resonance in the interaction of a spherical Ag nanoparticle with an emitting dipole

最新资源