HRNet：高分辨率表示在像素与区域标注中的应用

需积分: 16 26 浏览量更新于2024-08-05 收藏 490KB PDF 举报

"High-Resolution Representations for Labeling Pixels and Regions" 本文主要探讨的是高分辨率表示在像素和区域标注中的应用，特别提到了一种名为HRNet（High-Resolution Network）的深度学习模型。HRNet在人体姿态估计和语义分割等视觉问题中扮演着至关重要的角色。传统的深度网络在下采样过程中往往会丢失图像的细节信息，而HRNet则通过一种创新的方式解决了这个问题。 HRNet的核心特性在于其全程保持高分辨率特征表示。它通过并行连接高到低分辨率的卷积层，确保了在整个处理过程中都能保留高分辨率的信息。这种设计使得网络能够同时处理全局和局部信息，从而生成强大的高分辨率表示。此外，HRNet还通过在平行卷积层之间反复进行融合操作，进一步增强了这些表示的性能。作者们对高分辨率表示进行了深入研究，分析了它们在像素级和区域级任务中的优势。在像素标注任务中，高分辨率的特征可以更精确地捕捉到像素级别的细节，这对于像边缘检测和细粒度分类这样的任务至关重要。在区域标注任务中，如语义分割，高分辨率表示有助于区分紧密相邻的类别，并且能更好地保留物体的结构信息。论文中可能涉及了实验部分，展示了HRNet与其他流行方法的对比，可能包括精度提升、计算效率以及内存消耗等方面的评估。通过实验结果，作者们证明了HRNet在多种视觉任务上的优越性，尤其是在那些需要精细细节和结构理解的任务中。此外，论文可能还讨论了HRNet的变种或扩展，比如针对不同任务的优化策略、模型的可扩展性和模块化设计，以及可能的未来工作方向，如将HRNet应用于其他领域，如自动驾驶、医学图像分析等。 "High-Resolution Representations for Labeling Pixels and Regions"这篇论文深入探讨了高分辨率特征在计算机视觉任务中的重要性，并提出了一种有效维护高分辨率信息的网络架构HRNet。这项工作对于理解深度学习在图像处理中的作用，特别是在需要高精度和丰富细节的任务中，具有重要的理论和实践意义。

(a) (b)

≡

(c)

Figure 2. Multi-resolution block: (a) multi-resolution group con-

volution and (b) multi-resolution convolution. (c) A normal con-

volution (left) is equivalent to fully-connected multi-branch con-

volutions (right).

and interlinked CNNs [132], lack careful design on when

to start low-resolution parallel streams and how and when

to exchange information across parallel streams, and do not

use batch normalization and residual connections, thus not

showing satisfactory performance.

GridNet [30] is like a combination of multiple U-Nets

and includes two symmetric information exchange stages:

the ﬁrst stage only passes information from high-resolution

to low-resolution, and the second stage only passes infor-

mation from low-resolution to high-resolution. This limits

its segmentation quality.

3. Learning High-Resolution Representations

The high-resolution network [91], which we named

HRNetV1 for convenience, maintains high-resolution rep-

resentations by connecting high-to-low resolution convolu-

tions in parallel, where there are repeated multi-scale fu-

sions across parallel convolutions.

Architecture. The architecture is illustrated in Figure 1.

There are four stages, and the 2nd, 3rd and 4th stages are

formed by repeating modularized multi-resolution blocks.

A multi-resolution block consists of a multi-resolution

group convolution and a multi-resolution convolution which

is illustrated in Figure 2 (a) and (b). The multi-resolution

group convolution is a simple extension of the group convo-

lution, which divides the input channels into several subsets

of channels and performs a regular convolution over each

subset over different spatial resolutions separately.

The multi-resolution convolution is depicted in Figure 2

(b). It resembles the multi-branch full-connection manner

of the regular convolution, illustrated in in Figure 2 (c). A

regular convolution can be divided as multiple small con-

volutions as explained in [122]. The input channels are

divided into several subsets, and the output channels are

also divided into several subsets. The input and output sub-

sets are connected in a fully-connected fashion, and each

connection is a regular convolution. Each subset of output

channels is a summation of the outputs of the convolutions

over each subset of input channels.

The differences lie in two-fold. (i) In a multi-resolution

convolution each subset of channels is over a different res-

olution. (ii) The connection between input channels and

output channels needs to handle The resolution decrease is

implemented in [91] by using several 2-strided 3 × 3 con-

volutions. The resolution increase is simply implemented

in [91] by bilinear (nearest neighbor) upsampling.

Modiﬁcation. In the original approach HRNetV1, only the

representation (feature maps) from the high-resolution con-

volutions in [91] are outputted, which is illustrated in Fig-

ure 3 (a). This means that only a subset of output channels

from the high-resolution convolutions is exploited and other

subsets from low-resolution convolutions are lost.

We make a simple yet effective modiﬁcation by exploit-

ing other subsets of channels outputted from low-resolution

convolutions. The beneﬁt is that the capacity of the multi-

resolution convolution is fully explored. This modiﬁcation

only adds a small parameter and computation overhead.

We rescale the low-resolution representations through

bilinear upsampling to the high resolution, and concate-

nate the subsets of representations, illustrated in Figure 3

(b), resulting in the high-resolution representation, which

we adopt for estimating segmentation maps/facial landmark

heatmaps. In application to object detection, we construct

a multi-level representation by downsampling the high-

resolution representation with average pooling to multiple

levels, which is depicted in Figure 3 (c). We name the two

modiﬁcations as HRNetV2 and HRNetV2p, respectively,

and empirically compare them in Section 4.4.

Instantiation We instantiate the network using a similar

manner as HRNetV1 [91]

. The network starts from a stem

that consists of two strided 3 × 3 convolutions decreasing

the resolution to 1/4. The 1st stage contains 4 residual units

where each unit is formed by a bottleneck with the width 64,

and is followed by one 3× 3 convolution reducing the width

of feature maps to C. The 2nd, 3rd, 4th stages contain 1, 4,

3 multi-resolution blocks, respectively. The widths (number

of channels) of the convolutions of the four resolutions are

C, 2C, 4C, and 8C, respectively. Each branch in the multi-

resolution group convolution contains 4 residual units and

each unit contains two 3×3 convolutions in each resolution.

In applications to semantic segmentation and facial land-

mark detection, we mix the output representations (Figure 3

(b)), from all the four resolutions through a 1 × 1 convolu-

tion, and produce a 15C-dimensional representation. Then,

we pass the mixed representation at each position to a lin-

ear classiﬁer/regressor with the softmax/MSE loss to pre-

dict the segmentation maps/facial landmark heatmaps. For

semantic segmentation, the segmentation maps are upsam-

pled (4 times) to the input size by bilinear upsampling for

both training and testing. In application to object detection,

https://github.com/leoxiaobin/

deep-high-resolution-net.pytorch

剩余12页未读，继续阅读

TracelessLe

粉丝: 5w+
资源: 466

HRNet：高分辨率表示在像素与区域标注中的应用

Adapted and adaptive linear time-frequency representations

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ 词转为词向量的公式

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ 讲解词向量是怎么定义的

推荐20篇关于多特征服装检索的文献

What Can Human Sketches Do for Object Detection?

GRAPH-RELATIONAL DOMAIN ADAPTATION

deep closest point: learning representations for point cloud registration

Compare adjacency matrix and adjacency list for graph representation. How you choose which presentation to use depending on task and graph?

What is muti-head attention?

最新资源