3D动态体素：一种用于深度视频中的动作识别的紧凑编码方法

156 浏览量更新于2023-10-24 收藏 12.49MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

Yancheng Wang1, Yang Xiao1†, Fu Xiong2, Wenxiang Jiang1, Zhiguo Cao1, Joey Tianyi Zhou3, and4oYZXoYZXo......��151103DV：用于深度视频中的动作识别的3D动态体素01 中国华中科技大学人工智能与自动化学院多光谱信息处理国家重点实验室，武汉 430074 2 Megvii Research Nanjing, MegviiTechnology, 中国 3 IHPC, A*STAR, 新加坡 4 美国纽约州立大学布法罗分校计算机科学与工程系 yancheng wang, YangXiao@hust.edu.cn, xiongfu@megvii.com, wenx jiang, zgcao@hust.edu.cn, zhouty@ihpc.a-star.edu.sg, jsyuan@buffalo.edu0摘要0为了促进基于深度的3D动作识别，提出了一种新的3D运动表示方法，即3D动态体素（3DV）。通过3D空间体素化，3DV的关键思想是将深度视频中的3D运动信息紧凑地编码到一个规则的体素集合（即3DV）中，通过时间排序池化。每个可用的3DV体素内在地包含了3D空间和运动特征。然后，将3DV抽象为一个点集，并将其作为输入传递给PointNet++进行3D动作识别，以端到端的学习方式。将3DV转换为点集形式的直觉是，PointNet++对于点集的深度特征学习是轻量级且有效的。由于3DV可能丢失外观线索，还提出了一种多流的3D动作识别方法，以联合学习运动和外观特征。为了提取更丰富的动作时间顺序信息，还将深度视频划分为时间段，并将这个过程整体编码到3DV中。在4个知名的基准数据集上进行了大量实验证明了我们方法的优越性。令人印象深刻的是，在NTU RGB+D 120[13]上，我们的方法在跨主体和跨设置的测试设置下分别获得了82.4%和93.5%的准确率。3DV的代码可在https://github.com/3huo/3DV-Action上获得。01. 引言0在过去的十年中，由于低成本深度相机（例如Microsoft Kinect[52]），3D动作识别成为一个活跃的研究课题，广泛应用于视频监控、人机交互等领域[45,46]。目前最先进的3D动作识别方法可以一般分为基于深度的[28, 17, 16, 36, 11, 51, 35,34]和基于骨架的[32, 48, 22, 10, 42,46]两大类。由于准确且鲁棒的3D人体姿态估计仍然具有挑战性[47,21]，我们在本文中专注于基于深度的方法。由于人类在3D空间中进行动作，有效且高效地捕捉3D运动模式对于基于深度的3D动作识别至关重要。一种直观的方法是计算密集的场景光流[1]。然而，这可能耗时[1]，在实际应用中可能不被 prefer。最近，动态图像[3,2]能够将RGB视频压缩成单个图像，同时通过时间排序池化[6,5]保持运动特征，已被引入到深度领域用于3D动作特征化[42, 46]。因此，动态图像0† 杨晓为通讯作者（Yang Xiao@hust.edu.cn）。0无效0无效0图1. NTU RGB+D60数据集[33]中的实时“握手”3DV示例。3DV运动值反映了3D运动分量的时间顺序。后面的运动分量具有较高的值，反之亦然。富含运动信息的局部区域在3DV运动值上具有较高的标准差。0将其分为基于深度的[28, 17, 16, 36, 11, 51, 35,34]和基于骨架的[32, 48, 22, 10, 42,46]两大类。由于准确且鲁棒的3D人体姿态估计仍然具有挑战性[47,21]，我们在本文中专注于基于深度的方法。由于人类在3D空间中进行动作，有效且高效地捕捉3D运动模式对于基于深度的3D动作识别至关重要。一种直观的方法是计算密集的场景光流[1]。然而，这可能耗时[1]，在实际应用中可能不被prefer。最近，动态图像[3,2]能够将RGB视频压缩成单个图像，同时通过时间排序池化[6,5]保持运动特征，已被引入到深度领域用于3D动作特征化[42,46]。它可以将RGB视频压缩成一个图像，同时通过时间排序池化[6, 5]保持运动特征。因此，动态图像5120深度卷积神经网络（CNN）模型可以很好地适应动作分类，这是由于CNN强大的模式表示能力。然而，我们认为[42,46]中将动态图像应用于3D领域的方法没有充分利用深度视频中的3D描述性线索，尽管应用了法向量[42]或多视图投影[46]。我们的观点是，[42,46]中的两种方法最终将3D运动信息编码到2D图像平面上以适应CNN。因此，它们无法很好地回答“特定的3D运动模式在3D空间中的哪里出现？”这个对于有效的3D动作特征化至关重要的问题，因为人类动作实际上由运动模式和紧凑的空间结构组成。为了解决上述问题，我们提出了一种新的3D运动表示方法，即3D动态体素（3DV），用于3D动作表示。为了提取3DV，首先执行3D空间体素化。每个深度帧将被转换为一个规则的体素集合。通过观察产生的体素是否被占用，可以以二进制方式编码其中的外观内容[40]。然后，对所有二进制体素集合执行时间排序池化[6,5]，将它们压缩成一个称为3DV的单一体素集合。因此，3D动作的3D运动和空间特征可以共同编码到3DV中。为了揭示这一点，在图1中提供了一个实时的“握手”3DV示例。如图所示，每个可用的3DV体素都具有能够反映其对应的3D运动分量的时间顺序的运动值。具体而言，后面的运动分量具有较高的值，反之亦然。同时，富含3D运动信息的局部区域在3DV运动值上具有较高的标准差（例如，手部区域与头部区域）。同时，3DV体素的位置揭示了其3D运动分量的3D位置。因此，3DV的空间-运动代表能力可以从根本上提升3D动作的特征化。为了获得更丰富的时间顺序信息，我们进一步将深度视频划分为更细的时间段。这在3DV中被整体地编码，通过融合所有时间段的运动值。有了3DV，接下来的问题是如何选择适应的深度学习模型来进行3D动作识别。对于体素集合，通常使用3D CNN [20, 7,21]进行3D视觉模式理解，也适用于3DV。然而，由于卷积参数的数量庞大，训练起来很困难。受点集上轻量级深度学习模型（例如PointNet++[25]）的最近成功启发，我们提出将3DV转换为点集形式，作为PointNet++的输入，以进行端到端的3D动作识别。也就是说，每个3DV体素将被抽象为一个点，其特征由其3D位置索引和运动值描述。我们的直觉是减轻训练的困难和负担。尽管3DV可以揭示3D运动信息，但仍然可能丢失外观细节，如图1所示。由于外观0（a）人物-物体交互0（b）自遮挡0图2. NTU RGB+D60数据集[13]中的3D骨架提取失败案例，由于人物-物体交互和自遮挡。深度帧及其RGB对应帧一起显示。0此外，还提出了一种使用PointNet++的多流深度学习模型，以联合学习3D运动和外观特征。特别地，它包括一个运动流和多个外观流。运动流的输入是3DV。外观流的输入是从不同时间分割中采样的深度帧。它们也将被转换为点集形式以适应PointNet++。在2个大规模的3D动作识别数据集（即NTU RGB+D 120 [13]和60[33]）以及2个小规模的数据集（即N-UCLA[41]和UWA3DII[26]）上进行的实验证实了3DV相对于最先进方法的优越性。本文的主要贡献包括：•3DV：一种新颖而紧凑的3D运动表示方式，用于3D动作表征；•从点集的角度，将PointNet++应用于3DV进行端到端学习的3D动作识别；•提出了一种多流深度学习模型，以联合学习3D运动和外观特征。02. 相关工作03D动作识别。现有的3D动作识别方法通常可以分为基于深度的[23, 23, 48,22, 10, 42, 46]和基于骨架的组[15, 17, 16, 36, 11, 51, 35,34]。最近，基于骨架的使用RNN [15]和GCN[35]的方法引起了更多的关注，因为使用3D骨架可以帮助抵抗场景、人体属性、成像视角等方面的变化的影响。然而，一个关键问题不容忽视。那就是，准确和鲁棒的3D人体姿势估计仍然不是一件容易的事情[47,21]。为了揭示这一点，我们仔细检查了NTU RGB+D 60[13]中的3D骨架。实际上，即使在受限条件下，3D骨架提取仍然可能无法正常工作，如图2所示。因此，目前对于实际应用来说，基于深度的方式似乎更受青睐，也是我们关注的重点。大部分付费的努力都集中在提出3D动作表示方式来捕捉3D时空外观或运动模式。在早期阶段，手工制作的3D点袋描述[12]、深度运动[23,37]也在动作识别中起着重要作用，但仅使用3DV可能会削弱性能。为了缓解这个问题，还提出了一种使用PointNet++的多流深度学习模型，以联合学习3D运动和外观特征。特别地，它包括一个运动流和多个外观流。运动流的输入是3DV。外观流的输入是从不同时间分割中采样的深度帧。它们也将被转换为点集形式以适应PointNet++。在2个大规模的3D动作识别数据集（即NTURGB+D 120 [13]和60 [33]）以及2个小规模的数据集（即N-UCLA[41]和UWA3DII[26]）上进行的实验证实了3DV相对于最先进方法的优越性。本文的主要贡献包括：• 3DV：一种新颖而紧凑的3D运动表示方式，用于3D动作表征；•从点集的角度，将PointNet++应用于3DV进行端到端学习的3D动作识别；•提出了一种多流深度学习模型，以联合学习3D运动和外观特征。Our research motivation on 3DV is to seek a compact3D motion representative manner to characterize 3D action.Accordingly, deep feature learning can be easily conduct-ed on it. The proposition of 3DV can be regarded as theessential effort for extending temporal rank pooling [6, 5]originally for 2D video to 3D domain, to capture 3D mo-tion pattern and spatial clue jointly. The main idea for 3DVextraction is in Fig. 3. The depth frames will be ﬁrst map in-to point clouds to better reveal 3D characteristics. Then, 3Dvoxelization is executed to further transform the disorderedpoint clouds into the regular voxel sets. Consequently, 3Daction appearance clue within the certain depth frame canbe described by judging whether the voxels have been oc-(a) Bow(b) Sit down(c) Hugging(d) PushingFigure 5. The 3DV examples from NTU RGB+D 60 dataset [33].can be applied to them for 3DV extraction. Meanwhile thebinary voxel-wise representation manner is of higher tol-erance towards the intrinsic sparsity and density variabilityproblem [25] within point clouds, which essentially helpsto leverage generalization power.3.2. 3DV extraction using temporal rank poolingWith the binary 3D appearance voxel sets above, tem-poral rank pooling is executed to generate 3DV. A lineartemporal ranking score function will be deﬁned for com-pressing the voxel sets into one voxel set (i.e., 3DV).Particularly, suppose Vi, . . . , VT indicate the binary 3Dappearance voxel sets, and Vi = 1t × �ti Vi is their averagetill time t. The ranking score function at time t is given byS(t|w) =�w, Vi�,(2)where w ∈ Rd is the ranking parameter vector. w is learnedfrom the depth video to reﬂect the ranking relationship a-mong the frames. The criteria is that, the later frames are oflarger ranking scores asq > t ⇒ S(q|w) > S(t|w).(3)The learning procedure of w is formulated as a convex op-timization problem using RankSVM [38] asw∗ = argminwλ2 ∥ w ∥2 +2T(T − 2) ×�q>tmax {0, 1 − S(q|w) + S(t|w)}.(4)Speciﬁcally, the ﬁrst term is the often used regularizer forSVM. And, the second is the hinge-loss for soft-countinghow many pairs q > t are incorrectly ranked, which doesnot obey S(q|w) > S(t|w)+1. Optimizing Eqn. 4 can mapthe 3D appearance voxel sets Vi, · · ·, VT to a single vectorw∗. Actually, w∗ encodes the dynamic evolution informa-tion from all the frames. Spatially reordering w∗ from 1D to3D in voxel form can construct 3DV for 3D action charac-terization. Thus, each 3DV voxel can be jointly encoded bythe corresponding w∗ item as motion feature and its regular3D position index (x, y, z) as spatial feature. Some more3DV examples are shown in Fig. 5. We can intuitively ob-serve that, 3DV can actually distinguish the different action-s from motion perspective even human-object or human-human interaction happens. Meanwhile to accelerate 3DV��Zo��XV��YP�� GmFigure 7. The procedure of abstracting 3DV voxel V {x, y, z} into3DV point P {x, y, z}.deep learning on 3DV instead of 3D CNN concerning effec-tiveness and efﬁciency jointly. To this end, 3DV will be ab-stracted into point set form. To our knowledge, using Point-Net++ to deal with voxel data has not been well studied be-fore. Meanwhile since 3DV tends to loose some appearanceinformation as shown Fig. 4, a multi-stream deep learningmodel based on PointNet++ is also proposed to learn ap-pearance and motion feature for 3D action characterization.4.1. Review on PointNet++PointNet++ [25] is derived from PointNet [24], the pi-oneer in deep learning on point set. PointNet is proposedmainly to address the disordered problem within pointclouds. However, it cannot capture the local ﬁne-grainedpattern well. PointNet++ alleviates this in a local-to-globalhierarchical learning manner. It declares 2 main contribu-tions. First, it proposes to partition the set of points intooverlap local regions to better maintain local ﬁne 3D visualclue. Secondly, it uses PointNet recursively as the local fea-ture learner. And, the local features will be further groupedinto larger units to reveal the global shape characteristics. Insummary, PointNet++ generally inherits the merits of Point-Net but with stronger local ﬁne-grained pattern descriptivepower. Compared with 3D CNN, PointNet++ is generallyof more light-weight model size and higher running speed.Meanwhile, it tends to be easier to train.The main intuitions for why we apply PointNet++ to 3D-V lie into 3 folders. First, we do not want to trap in the train-ing challenges of 3D CNN. Secondly, PointNet++ is good atcapturing local 3D visual pattern, which is beneﬁcial for 3Daction recognition. That is, local 3D motion pattern actual-ly plays vita role for good 3D action characterization, as thehand region shown in Fig. 1 towards “Handshaking”. Last,applying PointNet++ to 3DV is not a difﬁcult task. Whatwe need to do is to abstract 3DV into the point set form,which will be illustrated next.4.2. Abstract 3DV into point setSuppose the acquired 3DV for a depth video withouttemporal split is of size H × W × D, each 3DV voxelV (x, y, z) will possesses a global motion value mG given��the raw depth point sets sampled from T2 temporal splitswith action proposal. Particularly, they share the same ap-pearance PointNet++. Motion and appearance feature is latefused via concatenation at fully-connected layer.5. Implementation details3DV voxel is set of size 35mm×35mm×35mm. T1 andT2 is set to 4 and 3 respectively, for multi-temporal motionand appearance feature extraction. For PointNet++, farthestpoint sampling is used on the centroids of local regions. Thesampled points are grouped with ball query. The group ra-dius at the ﬁrst and second level is set to 0.1 and 0.2 respec-tively. Adam [9] is applied as the optimizer with batch sizeof 32. Leaning rate begins with 0.001, and decays with arate of 0.5 every 10 epochs. Training will end when reach-ing 70 epochs. During training, we perform data augmenta-tion for 3DV points and raw depth points including randomrotation around Y and X axis, jittering and random pointsdropout. Multi-stream network is implemented using Py-Torch. Within each stream, PointNet++ will sample 2048points for both of motion and appearance feature learning.6. Experiments6.1. Experimental settingDataset: NTU RGB+D 120 [13]. It is the most recentlyemerged challenging 3D action recognition dataset, and al-so of the largest size. Particularly, 114,480 RGB-D actionsamples of 120 categories captured using Microsoft Kinectv2 are involved in this dataset. These involved action sam-ples are of large variation on subject, imaging viewpointand background. This imposes essential challenges to 3Daction recognition. The accuracy of the state-of-the-art ap-proaches is not satisfactory (i.e., below 70%) both under thecross-subject and cross-setup evaluation criteria.Dataset: NTU RGB+D 60 [33]. It is the preliminaryversion of NTU RGB+D 120.That is, 56,880 RGB-Daction samples of 60 categories captured using MicrosoftKinect v2 are involved in this dataset. Before NTU RGB+D120, it is the largest 3D action recognition dataset. Cross-subject and cross-view evaluation criteria is used for test.Dataset: N-UCLA [41]. Compared with NTU RGB+D120 and NTU RGB+D 120, this is a relatively small-scale3D action recognition dataset. It only contains 1475 actionsamples of 10 action categories. These samples are capturedusing Microsoft Kinect v1 from 3 different viewpoints, withrelatively higher imaging noise. Cross-view evaluation cri-teria is used for test.Dataset: UWA3DII [26]. This is also a small-scale 3Daction recognition dataset with only 1075 video samplesfrom 30 categories. One essential challenge of this datasetis the limited number of training samples per action catego-ry. And, the samples are captured using Microsoft KinectTable 1. Performance comparison on action recognition accuracy(%) among different methods on NTU RGB+D 120 dataset.MethodsCross-subjectCross-setupInput: 3D SkeletonNTU RGB+D 120 baseline [13]55.757.9GCA-LSTM [17]58.359.3FSNet [14]59.962.4Two stream attention LSTM [16]61.263.3Body Pose Evolution Map [18]64.666.9SkeleMotion [4]67.766.9Input: Depth mapsNTU RGB+D 120 baseline [13]48.740.13DV-PointNet++ (ours)82.493.5Table 2. Performance comparison on action recognition accuracy(%) among different methods on NTU RGB+D 60 dataset.MethodsCross-subjectCross-viewInput: 3D SkeletonSkeleMotion [4]69.680.1GCA-LSTM [17]74.482.8Two stream attention LSTM [16]77.185.1AGC-LSTM [36]89.295.0AS-GCN [11]86.894.2VA-fusion [51]89.495.02s-AGCN [35]88.595.1DGNN [34]89.996.1Input: Depth mapsHON4D [23]30.67.3SNV [48]31.813.6HOG2 [22]32.222.3Li. [10]68.183.4Wang. [42]87.184.2MVDI [46]84.687.33DV-PointNet++ (ours)88.896.3v1 with relatively high imaging noise.Input data modality and evaluation metric. Duringexperiments, the input data of our proposed 3DV based 3Daction recognition method is only depth maps. We will notuse any other auxiliary information, such as skeleton, RGBimage, human mask, etc. The training / test sample splitsand testing setups on all the 4 datasets are strictly followedfor fair comparison. Classiﬁcation accuracy on all the ac-tion samples is reported for performance evaluation.6.2. Comparison with state-of-the-art methodsNTU RGB+D 120: Our 3DV based approach is com-pared with the state-of-the-art skeleton-based and depth-based 3D action recognition methods [13, 17, 16, 18, 4] onthis dataset. The performance comparison is listed in Ta-ble 1. We can observe observed that:• It is indeed impressive that, our proposition achievesthe breaking-through results on this large-scale challengingdataset both towards the cross-subject and cross-setup testsettings. Particularly we achieve 82.4% and 93.5% on these2 settings respectively, which outperforms the state-of-the-art manners by large margins (i.e., 14.7% at least on cross-subject, and 26.6% at least on cross-setup). This essentiallyveriﬁes the superiority of our proposition;• The performance of the other methods is poor. Thisreveals the great challenges of NTU RGB+D 120 dataset;516Table 3. Performance comparison on action recognition accuracy(%) among different depth-based methods on N-UCLA dataset.MethodsAccuracyHON4D [23]39.9SNV [48]42.8AOG [41]53.6HOPC [27]80.0MVDI [46]84.23DV-PointNet++ (ours)95.3Table 4. Performance comparison on action recognition accuracy(%) among different depth-based methods on UWA3DII dataset.MethodsMean accuracyHON4D [23]28.9SNV [48]29.9AOG [41]26.7HOPC [27]52.2MVDI [46]68.13DV-PointNet++ (ours)73.2• Our method achieves better performance on cross-setup case than cross-subject. This implies that, 3DV ismore sensitive to subject variation.NTU RGB+D 60: The proposed method is comparedwith the state-of-the-art approaches [17, 16, 36, 11, 51, 35,34, 23, 48, 22, 10, 42, 46] on this dataset. The performancec

下载后可阅读完整内容，剩余1页未读，立即下载

cpongm

粉丝: 5
资源: 2万+

3D动态体素：一种用于深度视频中的动作识别的紧凑编码方法

open3d体积计算 体素计算

基于不同的python库进行3D数据体素化-python源码.zip

matlab体素化程序_matlab体素采样_voxel_matlab体素化_体素化

dynamic_voxel

3d地图 体素 汽车射线

open3D体素滤波

python如何将平面数据 txt 重建3d体素剂量

深度学习三维重建模型有哪些

基于体素的自适应采样方法

已知点云三维坐标值存放在三维数组中，怎么用open3d进行体素化 python

曼哈顿距离中的体素查询

体素网格,利用3d卷积处理点云数据

cloudcompare体素滤波

deep learning for 3d point clouds: a survey

unity如何实现体素化寻路

3d点云图怎么实现深度学习

基于八叉树的3d点体素化以及可视化python代码

静态物体 体素 动态物体三维占用网格

python体素化方法

最新资源

open3d体积计算体素计算

3d地图体素汽车射线

静态物体体素动态物体三维占用网格