深度学习高分辨率表示用于人体姿态估计

需积分: 0 161 浏览量更新于2024-08-05 收藏 1.95MB PDF 举报

"这篇文章主要介绍了HRNet（High-Resolution Network）在人体姿态估计中的应用，强调了学习可靠的高分辨率表示的重要性。HRNet的独特之处在于它在整个网络过程中保持高分辨率的表示，而不是从低分辨率表示恢复。文章通过逐步添加高到低分辨率的子网络形成多个阶段，并将这些多分辨率子网络并行连接，实现反复的多尺度融合，从而得到丰富的高分辨率表示，提高了关键点预测的准确性。" 在计算机视觉领域，人体姿态估计是一项重要的任务，用于识别图像或视频中人物的关键关节位置。传统的深度学习方法通常采用高到低分辨率的网络结构，先进行特征提取，然后尝试恢复高分辨率的细节。然而，这种方法可能会损失部分精细信息，导致姿态估计的精度下降。 HRNet的出现打破了这一传统模式。论文的作者提出了一种全新的网络架构，它从一开始就保持高分辨率的表示，并在后续的网络阶段逐渐添加低分辨率的子网络。这种设计使得网络可以在处理高分辨率信息的同时，也能捕获低层次的抽象特征。每个高到低分辨率的子网络都与其他并行的表示进行多次融合，这增强了信息的交互和互补，提高了最终的高分辨率表示的质量。多尺度融合是HRNet的核心技术之一，它允许不同分辨率的特征图进行信息交换，确保了在保持高分辨率的同时，也能充分利用多级特征。这一特性对于人体姿态估计特别有利，因为人体的关键点往往分布在不同尺度上，需要在保持细节的同时，理解更广泛的上下文信息。通过这种不断融合和交互，HRNet能够在高分辨率的空间上保持准确的定位能力，从而提升了人体姿态估计的精度。实验结果表明，相比于其他方法，HRNet在多个基准测试中取得了显著的性能提升，证明了其在处理复杂场景和细微细节时的优势。 HRNet是一种创新的深度学习架构，它专注于学习和维护高分辨率的表示，为人体姿态估计提供了一种有效且精确的解决方案。这一工作对深度学习领域的研究者和开发者来说具有很高的参考价值，特别是对于那些关注图像细节和高精度任务的项目。

simply a few bilinear-upsampling [11] or transpose convo-

lution [72] layers. (iii) Combination with dilated convolu-

tions. In [27, 51, 35], dilated convolutions are adopted in

the last two stages in the ResNet or VGGNet to eliminate

the spatial resolution loss, which is followed by a light low-

to-high process to further increase the resolution, avoiding

expensive computation cost for only using dilated convolu-

tions [11, 27, 51]. Figure 2 depicts four representative pose

estimation networks.

Multi-scale fusion. The straightforward way is to feed

multi-resolution images separately into multiple networks

and aggregate the output response maps [64]. Hour-

glass [40] and its extensions [77, 31] combine low-level

features in the high-to-low process into the same-resolution

high-level features in the low-to-high process progres-

sively through skip connections. In cascaded pyramid net-

work [11], a globalnet combines low-to-high level features

in the high-to-low process progressively into the low-to-

high process, and then a reﬁnenet combines the low-to-high

level features that are processed through convolutions. Our

approach repeats multi-scale fusion, which is partially in-

spired by deep fusion and its extensions [67, 73, 59, 80, 82].

Intermediate supervision. Intermediate supervision or

deep supervision, early developed for image classiﬁca-

tion [34, 61], is also adopted for helping deep networks

training and improving the heatmap estimation quality,

e.g., [69, 40, 64, 3, 11]. The hourglass approach [40] and

the convolutional pose machine approach [69] process the

intermediate heatmaps as the input or a part of the input of

the remaining subnetwork.

Our approach. Our network connects high-to-low sub-

networks in parallel. It maintains high-resolution repre-

sentations through the whole process for spatially precise

heatmap estimation. It generates reliable high-resolution

representations through repeatedly fusing the representa-

tions produced by the high-to-low subnetworks. Our ap-

proach is different from most existing works, which need

a separate low-to-high upsampling process and aggregate

low-level and high-level representations. Our approach,

without using intermediate heatmap supervision, is superior

in keypoint detection accuracy and efﬁcient in computation

complexity and parameters.

There are related multi-scale networks for classiﬁcation

and segmentation [5, 8, 74, 81, 30, 76, 55, 56, 24, 83,

55, 52, 18]. Our work is partially inspired by some of

them [56, 24, 83, 55], and there are clear differences making

them not applicable to our problem. Convolutional neural

fabrics [56] and interlinked CNN [83] fail to produce high-

quality segmentation results because of a lack of proper de-

sign on each subnetwork (depth, batch normalization) and

multi-scale fusion. The grid network [18], a combination

of many weight-shared U-Nets, consists of two separate fu-

sion processes across multi-resolution representations: on

the ﬁrst stage, information is only sent from high resolution

to low resolution; on the second stage, information is only

sent from low resolution to high resolution, and thus less

competitive. Multi-scale densenets [24] does not target and

cannot generate reliable high-resolution representations.

3. Approach

Human pose estimation, a.k.a. keypoint detection, aims

to detect the locations of K keypoints or parts (e.g., elbow,

wrist, etc) from an image I of size W × H × 3. The state-

of-the-art methods transform this problem to estimating K

heatmaps of size W

×H

, {H

, H

, . . . , H

}, where each

heatmap H

indicates the location conﬁdence of the kth

keypoint.

We follow the widely-adopted pipeline [40, 72, 11] to

predict human keypoints using a convolutional network,

which is composed of a stem consisting of two strided con-

volutions decreasing the resolution, a main body outputting

the feature maps with the same resolution as its input fea-

ture maps, and a regressor estimating the heatmaps where

the keypoint positions are chosen and transformed to the

full resolution. We focus on the design of the main body

and introduce our High-Resolution Net (HRNet) that is de-

picted in Figure 1.

Sequential multi-resolution subnetworks. Existing net-

works for pose estimation are built by connecting high-to-

low resolution subnetworks in series, where each subnet-

work, forming a stage, is composed of a sequence of con-

volutions and there is a down-sample layer across adjacent

subnetworks to halve the resolution.

Let N

be the subnetwork in the sth stage and r be the

resolution index (Its resolution is

r−1

of the resolution of

the ﬁrst subnetwork). The high-to-low network with S (e.g.,

4) stages can be denoted as:

→ N

(1)

Parallel multi-resolution subnetworks. We start from a

high-resolution subnetwork as the ﬁrst stage, gradually add

high-to-low resolution subnetworks one by one, forming

new stages, and connect the multi-resolution subnetworks

in parallel. As a result, the resolutions for the parallel sub-

networks of a later stage consists of the resolutions from the

previous stage, and an extra lower one.

An example network structure, containing 4 parallel sub-

networks, is given as follows,

→ N

& N

→ N

& N

→ N

& N

(2)

剩余11页未读，继续阅读

惜年_night

粉丝: 1772
资源: 1

深度学习高分辨率表示用于人体姿态估计

高分辨率特征提取的HRNet分类网络

深度学习框架HRNet源码解读

深度学习模型压缩包：HRNet-W18-Small提取

hrnet中文版（自己翻译的）

cls_hrnet_hrnet_

HRNet_Dockerfile

hrnet 1908论文

HRNet-works：HRNet的实践_人为姿势估算

HRNet-Image-Classification:在ImageNet上训练HRNet模型

hrnet yolo

最新资源