2022 CVPR论文：PoseConv3D：骨架动作识别的新方法

需积分: 0 7 浏览量更新于2024-08-03 收藏 5.47MB PDF 举报

《Duan等人于2022年在CVPR会议上发表的论文"Revisiting Skeleton-Based Action Recognition: A New Perspective with PoseConv3D"》探讨了近年来在人类动作识别领域，特别是基于骨骼表示方法的研究进展。随着计算机视觉技术的发展，人类骨架作为紧凑的人体动作表示，其在动作识别中的应用越来越受到重视。传统的许多方法倾向于采用图形卷积网络（GCN）来挖掘骨骼序列的特征，这些方法在一定程度上取得了积极成果。然而，GCN基础的方法存在几个局限性。首先，它们在鲁棒性方面有所欠缺，对骨骼估计误差敏感，这意味着在实际场景中可能无法有效处理噪声。其次，GCN方法在跨数据集上的泛化能力往往不足，这限制了它们在实际应用中的可扩展性和适应性。为了克服这些问题，该研究团队提出了PoseConv3D，这是一种创新的骨骼动作识别方法。 PoseConv3D的核心在于将人类骨骼的表示形式从图序列转换为三维热力图体积。相比于依赖于节点和边的GCN，这种新的框架更有利于学习空间和时间特征，因为它能够捕捉到骨骼在三维空间中的动态变化。此外，由于基于3D热力图的处理方式，PoseConv3D对骨骼估计误差更加稳健，能够更好地抵抗噪声干扰。另一个显著的优势是，PoseConv3D设计巧妙，使得它能够处理多个人同时出现的情况，无需额外的复杂计算或专门的多人处理模块。这意味着在多人场景的应用中，如社交活动监控或体育比赛分析，PoseConv3D具有显著的优势。该论文通过引入PoseConv3D，提供了一种更为有效、鲁棒且通用的骨骼动作识别解决方案，挑战了现有的GCN方法，并为后续的研究者们开辟了新的探索方向。在未来的工作中，作者可能会进一步优化模型架构，提升性能，并探索如何将其应用于更多实时和复杂的动作识别场景。

Revisiting Skeleton-based Action Recognition

Haodong Duan

1,3

Yue Zhao

Kai Chen

3,5

Dahua Lin

1,3

Bo Dai

3,4



The Chinese University of HongKong

The University of Texas at Austin

Shanghai AI Laboratory

S-Lab, Nanyang Technological University

SenseTime Research

Abstract

Human skeleton, as a compact representation of hu-

man action, has received increasing attention in recent

years. Many skeleton-based action recognition methods

adopt GCNs to extract features on top of human skele-

tons. Despite the positive results shown in these attempts,

GCN-based methods are subject to limitations in robust-

ness, interoperability, and scalability. In this work, we pro-

pose PoseConv3D, a new approach to skeleton-based ac-

tion recognition. PoseConv3D relies on a 3D heatmap vol-

ume instead of a graph sequence as the base representa-

tion of human skeletons. Compared to GCN-based methods,

PoseConv3D is more effective in learning spatiotemporal

features, more robust against pose estimation noises, and

generalizes better in cross-dataset settings. Also, PoseC-

onv3D can handle multiple-person scenarios without ad-

ditional computation costs. The hierarchical features can

be easily integrated with other modalities at early fusion

stages, providing a great design space to boost the perfor-

mance. PoseConv3D achieves the state-of-the-art on ﬁve

of six standard skeleton-based action recognition bench-

marks. Once fused with other modalities, it achieves the

state-of-the-art on all eight multi-modality action recog-

nition benchmarks. Code has been made available at:

https://github.com/kennymckormick/pyskl.

1. Introduction

Action recognition is a central task in video understand-

ing. Existing studies have explored various modalities for

feature representation, such as RGB frames [6, 54, 59], op-

tical ﬂows [47], audio waves [62], and human skeletons

[60, 64]. Among these modalities, skeleton-based action

recognition has received increasing attention in recent years

due to its action-focusing nature and compactness. In prac-

tice, human skeletons in a video are mainly represented as

a sequence of joint coordinate lists, where the coordinates

are extracted by pose estimators. Since only the pose infor-

mation is included, skeleton sequences capture only action

information while being immune to contextual nuisances,

such as background variation and lighting changes.

(a) 2D poses estimated with HRNet.

(b) 3D poses collected with Kinect. (c) 3D poses estimated with VIBE.

Figure 1. PoseConv3D takes 2D poses as inputs. In general, 2D

poses are of better quality than 3D poses. We visualize 2D poses

estimated with HRNet for videos in NTU-60 and FineGYM in (a).

Apparently, their quality is much better than 3D poses collected

by sensors (b) or estimated with state-of-the-art estimators (c).

Table 1. Differences between PoseConv3D and GCN.

Previous Work PoseConv3D

Input 2D / 3D Skeleton 2D Skeleton

Format Coordinates 3D Heatmap Volumes

Architecture GCN 3D-CNN

Among all the methods for skeleton-based action

recognition [15, 57, 58], graph convolutional networks

(GCN) [64] have been one of the most popular approaches.

Speciﬁcally, GCNs regard every human joint at every

timestep as a node. Neighboring nodes along the spatial

and temporal dimensions are connected with edges. Graph

convolution layers are then applied to the constructed graph

to discover action patterns across space and time. Due to

the good performance on standard benchmarks for skeleton-

based action recognition, GCNs have been a standard ap-

proach when processing skeleton sequences.

While encouraging results have been observed, GCN-

based methods are limited in the following aspects: (1) Ro-

bustness: While GCN directly handles coordinates of hu-

2969

下载后可阅读完整内容，剩余9页未读，立即下载

Crush_10

粉丝: 0
资源: 1

2022 CVPR论文：PoseConv3D：骨架动作识别的新方法

Out_SA-100duan_Ma1_Beta0_Alfa-3_Dx0_Dy0_Dz0.ht3

untitled_san_xiang_duan_lu.slx

01-Jiangang.Duan-open_speech.pdf

duan_lu_bao_hu_fang_an_.rar_circuit_短路_短路保护

zui_duan_lu.zip_zui

duan-dian-jian-ce-daquan.rar_JIAN_dian和duan的读_过检测CE

zhong-duan-.rar_通信 采样

zhong-duan.rar_linux中断_zhong duan

CCS-xia-jianli-zhong-duan.rar_CCSzhongduan

zui-duan.rar_最小路_权值路劲

最新资源

zhong-duan-.rar_通信采样