Transformer驱动的人体姿态与网格重建：METRO方法

人工智能

需积分: 9 117 浏览量更新于2024-07-06 收藏 5.12MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

template human mesh to preserve the positional information

of each query in the input sequence. To be speciﬁc, we

concatenate the image feature vector X ∈ R

2048×1

with

the 3D coordinates (x

, y

, z

) of every body joint i. This

forms a set of joint queries Q

= {q

, q

, . . . , q

}, where

∈ R

2051×1

. Similarly, we conduct the same positional

encoding for every mesh vertex j, and form a set of vertex

queries Q

= {q

, q

, . . . , q

}, where q

∈ R

2051×1

3.3. Masked Vertex Modeling

Prior works [9, 49] use the Masked Language Model-

ing (MLM) to learn the linguistic properties of a training

corpus. However, MLM aims to recover the inputs, which

cannot be directly applied to our regression task.

To fully activate the bi-directional attentions in our trans-

former encoder, we design a Masked Vertex Modeling

(MVM) for our regression task. We mask some percentages

of the input queries at random. Different from recovering

the masked inputs like MLM [9], we instead ask the trans-

former to regress all the joints and vertices.

In order to predict an output corresponding to a missing

query, the model will have to resort to other relevant queries.

This is in spirit similar to simulating occlusions where par-

tial body parts are invisible. As a result, MVM enforces

transformer to regress 3D coordinates by taking other rel-

evant vertices and joints into consideration, without regard

to their distances and mesh topology. This facilitates both

short- and long-range interactions among joints and vertices

for better human body modeling.

3.4. Training

To train the transformer encoder, we apply loss functions

on top of the transformer outputs, and minimize the errors

between predictions and ground truths. Given a dataset

D = {I

}

i=1

, where T is the total num-

ber of training images. I ∈ R

w×h×3

denotes an RGB im-

age.

∈ R

M×3

denotes the ground truth 3D coordi-

nates of the mesh vertices and M is the number of vertices.

∈ R

K×3

denotes the ground truth 3D coordinates of

the body joints and K is the number of joints of a person.

Similarly,

∈ R

K×2

denotes the ground truth 2D coor-

dinates of the body joints.

Let V

denote the output vertex locations, and J

the output joint locations, we use L

loss to minimize the

errors between predictions and ground truths:

i=1



−



(1)

i=1



−



(2)

It is worth noting that, the 3D joints can also be cal-

culated from the predicted mesh. Following the common

practice in literature [8, 22, 25, 24], we use a pre-deﬁned

regression matrix G ∈ R

K×M

, and obtain the regressed 3D

joints by J

reg

= GV

. Similar to prior works, we use L

loss to optimize J

reg

i=1



reg

−



(3)

2D re-projection has been commonly used to enhance

the image-mesh alignment [22, 25, 24]. Also, it helps visu-

alize the reconstruction in an image. Inspired by the prior

works, we project the 3D joints to 2D space using the esti-

mated camera parameters, and minimize the errors between

the 2D projections and 2D ground truths:

proj

i=1



−



(4)

where the camera parameters are learned by using a linear

layer on top of the outputs of the transformer encoder.

To perform large-scale training, it is highly desirable to

leverage both 2D and 3D training datasets for better gen-

eralization. As explored in literature [34, 22, 25, 24, 23,

8, 32], we use a mix-training strategy that leverages differ-

ent training datasets, with or without the paired image-mesh

annotations. Our overall objective is written as:

L = α × (L

+ L

reg

) + β × L

proj

(5)

where α and β are binary ﬂags for each training sample,

indicating the availability of 3D and 2D ground truths, re-

spectively.

3.5. Implementation Details

Our method is able to process arbitrary sizes of mesh.

However, due to memory constraints of current hardware,

our transformer processes a coarse mesh: (1) We use a

coarse template mesh (431 vertices) for positional encod-

ing, and transformer outputs a coarse mesh; (2) We use

learnable Multi-Layer Perceptrons (MLPs) to upsample the

coarse mesh to the original mesh (6890 vertices for SMPL

human mesh topology); (3) The transformer and MLPs are

trained end-to-end; Please note that the coarse mesh is ob-

tained by sub-sampling twice to 431 vertices with a sam-

pling algorithm [42]. As discussed in the literature [25],

the implementation of learning a coarse mesh followed by

upsampling is helpful to reduce computation. It also helps

avoid redundancy in original mesh (due to spatial locality

of vertices), which makes training more efﬁcient.

4. Experimental Results

We ﬁrst show that our method outperforms the previous

state-of-the-art human mesh reconstruction methods on Hu-

man3.6M and 3DPW datasets. Then, we provide ablation

剩余16页未读，继续阅读

MarcuseXiao

粉丝: 33
资源: 20

Transformer驱动的人体姿态与网格重建：METRO方法

3D Human Sensing, Action and Emotion Recognition in Robot Assisted

Reconstructing-3D-Human-Pose.rar_3D 重建_3D压缩感知_3d human pose_3维重建

近三年基于transformer的3D人体姿态识别

目前有哪些top-down方法的姿态估计网络，按年份梳理

Monte Carlo Simulation with reconstruction toolkit

TomoPy with the ASTRA toolbox

High precision 3D reconstruction based on binocular vision

基于拓扑的曲面重建参考资料

Indoor Scene Reconstruction using RGB-D Images and Point-Cloud Completion在哪看

Combining 3D Morphable Models: A Large scale Face-and-Head Model

从指定文件夹读取pcd格式点云转mesh，然后将mesh存储为pcd格式到指定文件夹下，并且可视化mesh，用 pcl 1.8.1c++代码实现，整理成我可以直接使用的代码，然后给出cmakelists

unity中 如何通过编写C#脚本调用Poisson Surface Reconstruction 从点云到网格的重建算法

open3d点云生成带贴图模型

Computes surface normals using four images with light source in different places.

reconstruction probability

focal frequency loss for image reconstruction and synthesis

3D Reconstruction for Autonomous Driving: A Survey

lta type with freesurfer

Reconstruction toolkit

最新资源

unity中如何通过编写C#脚本调用Poisson Surface Reconstruction 从点云到网格的重建算法