没有合适的资源?快使用搜索试试~ 我知道了~
首页Transformer驱动的人体姿态与网格重建:METRO方法
Transformer驱动的人体姿态与网格重建:METRO方法
需积分: 9 0 下载量 117 浏览量
更新于2024-07-06
收藏 5.12MB PDF 举报
本文介绍了一种名为MEsh TRansfOrmer (METRO) 的创新方法,这是一种专为从单张图像中重建3D人体姿态和网格顶点设计的人工智能技术。METRO的核心在于利用Transformer编码器,它能够同时处理顶点-顶点(vertex-vertex)和顶点-关节(vertex-joint)之间的交互,实现了3D关节坐标和网格顶点的联合预测,这一过程是端到端的。 与现有的基于姿势和形状参数回归的技术不同,METRO不依赖于诸如SMPL等参数化网格模型。这种灵活性使得METRO能够在其他对象,如手部,上进行扩展,拓宽了其应用范围。特别地,通过引入自注意力机制,允许Transformer自由关注任意两个顶点,从而能够学习到非局部的网格顶点和关节关系,增强了模型处理部分遮挡等复杂场景的能力。 文章提出了一种掩码顶点建模策略,进一步提升了METRO在处理挑战性情况下的鲁棒性和有效性。在公开的Human3.6M和3DPW数据集上,METRO展示了显著优于当前状态-of-the-art的性能。此外,论文还展示了通过METRO进行人体网格重建的详细实验结果,证明了其在精度和实用性上的优越性。 METRO不仅革新了3D人体建模的方法论,而且还展示了Transformer在图像处理领域的强大潜力,特别是在处理高精度3D重建任务时。这种方法的广泛适用性和在实际场景中的优秀表现,预示着它将在计算机图形学、虚拟现实和增强现实等领域引发进一步的研究和应用探索。
资源详情
资源推荐
template human mesh to preserve the positional information
of each query in the input sequence. To be specific, we
concatenate the image feature vector X ∈ R
2048×1
with
the 3D coordinates (x
i
, y
i
, z
i
) of every body joint i. This
forms a set of joint queries Q
J
= {q
J
1
, q
J
2
, . . . , q
J
n
}, where
q
J
i
∈ R
2051×1
. Similarly, we conduct the same positional
encoding for every mesh vertex j, and form a set of vertex
queries Q
V
= {q
V
1
, q
V
2
, . . . , q
V
m
}, where q
V
j
∈ R
2051×1
.
3.3. Masked Vertex Modeling
Prior works [9, 49] use the Masked Language Model-
ing (MLM) to learn the linguistic properties of a training
corpus. However, MLM aims to recover the inputs, which
cannot be directly applied to our regression task.
To fully activate the bi-directional attentions in our trans-
former encoder, we design a Masked Vertex Modeling
(MVM) for our regression task. We mask some percentages
of the input queries at random. Different from recovering
the masked inputs like MLM [9], we instead ask the trans-
former to regress all the joints and vertices.
In order to predict an output corresponding to a missing
query, the model will have to resort to other relevant queries.
This is in spirit similar to simulating occlusions where par-
tial body parts are invisible. As a result, MVM enforces
transformer to regress 3D coordinates by taking other rel-
evant vertices and joints into consideration, without regard
to their distances and mesh topology. This facilitates both
short- and long-range interactions among joints and vertices
for better human body modeling.
3.4. Training
To train the transformer encoder, we apply loss functions
on top of the transformer outputs, and minimize the errors
between predictions and ground truths. Given a dataset
D = {I
i
,
¯
V
i
3D
,
¯
J
i
3D
,
¯
J
i
2D
}
T
i=1
, where T is the total num-
ber of training images. I ∈ R
w×h×3
denotes an RGB im-
age.
¯
V
3D
∈ R
M×3
denotes the ground truth 3D coordi-
nates of the mesh vertices and M is the number of vertices.
¯
J
3D
∈ R
K×3
denotes the ground truth 3D coordinates of
the body joints and K is the number of joints of a person.
Similarly,
¯
J
2D
∈ R
K×2
denotes the ground truth 2D coor-
dinates of the body joints.
Let V
3D
denote the output vertex locations, and J
3D
is
the output joint locations, we use L
1
loss to minimize the
errors between predictions and ground truths:
L
V
=
1
M
M
X
i=1
V
3D
−
¯
V
3D
1
,
(1)
L
J
=
1
K
K
X
i=1
J
3D
−
¯
J
3D
1
.
(2)
It is worth noting that, the 3D joints can also be cal-
culated from the predicted mesh. Following the common
practice in literature [8, 22, 25, 24], we use a pre-defined
regression matrix G ∈ R
K×M
, and obtain the regressed 3D
joints by J
reg
3D
= GV
3D
. Similar to prior works, we use L
1
loss to optimize J
reg
3D
:
L
reg
J
=
1
K
K
X
i=1
J
reg
3D
−
¯
J
3D
1
.
(3)
2D re-projection has been commonly used to enhance
the image-mesh alignment [22, 25, 24]. Also, it helps visu-
alize the reconstruction in an image. Inspired by the prior
works, we project the 3D joints to 2D space using the esti-
mated camera parameters, and minimize the errors between
the 2D projections and 2D ground truths:
L
proj
J
=
1
K
K
X
i=1
J
2D
−
¯
J
2D
1
,
(4)
where the camera parameters are learned by using a linear
layer on top of the outputs of the transformer encoder.
To perform large-scale training, it is highly desirable to
leverage both 2D and 3D training datasets for better gen-
eralization. As explored in literature [34, 22, 25, 24, 23,
8, 32], we use a mix-training strategy that leverages differ-
ent training datasets, with or without the paired image-mesh
annotations. Our overall objective is written as:
L = α × (L
V
+ L
J
+ L
reg
J
) + β × L
proj
J
,
(5)
where α and β are binary flags for each training sample,
indicating the availability of 3D and 2D ground truths, re-
spectively.
3.5. Implementation Details
Our method is able to process arbitrary sizes of mesh.
However, due to memory constraints of current hardware,
our transformer processes a coarse mesh: (1) We use a
coarse template mesh (431 vertices) for positional encod-
ing, and transformer outputs a coarse mesh; (2) We use
learnable Multi-Layer Perceptrons (MLPs) to upsample the
coarse mesh to the original mesh (6890 vertices for SMPL
human mesh topology); (3) The transformer and MLPs are
trained end-to-end; Please note that the coarse mesh is ob-
tained by sub-sampling twice to 431 vertices with a sam-
pling algorithm [42]. As discussed in the literature [25],
the implementation of learning a coarse mesh followed by
upsampling is helpful to reduce computation. It also helps
avoid redundancy in original mesh (due to spatial locality
of vertices), which makes training more efficient.
4. Experimental Results
We first show that our method outperforms the previous
state-of-the-art human mesh reconstruction methods on Hu-
man3.6M and 3DPW datasets. Then, we provide ablation
4
剩余16页未读,继续阅读
MarcuseXiao
- 粉丝: 33
- 资源: 20
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C++多态实现机制详解:虚函数与早期绑定
- Java多线程与异常处理详解
- 校园导游系统:无向图实现最短路径探索
- SQL2005彻底删除指南:避免重装失败
- GTD时间管理法:提升效率与组织生活的关键
- Python进制转换全攻略:从10进制到16进制
- 商丘物流业区位优势探究:发展战略与机遇
- C语言实训:简单计算器程序设计
- Oracle SQL命令大全:用户管理、权限操作与查询
- Struts2配置详解与示例
- C#编程规范与最佳实践
- C语言面试常见问题解析
- 超声波测距技术详解:电路与程序设计
- 反激开关电源设计:UC3844与TL431优化稳压
- Cisco路由器配置全攻略
- SQLServer 2005 CTE递归教程:创建员工层级结构
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功