simply a few bilinear-upsampling [11] or transpose convo-
lution [72] layers. (iii) Combination with dilated convolu-
tions. In [27, 51, 35], dilated convolutions are adopted in
the last two stages in the ResNet or VGGNet to eliminate
the spatial resolution loss, which is followed by a light low-
to-high process to further increase the resolution, avoiding
expensive computation cost for only using dilated convolu-
tions [11, 27, 51]. Figure 2 depicts four representative pose
estimation networks.
Multi-scale fusion. The straightforward way is to feed
multi-resolution images separately into multiple networks
and aggregate the output response maps [64]. Hour-
glass [40] and its extensions [77, 31] combine low-level
features in the high-to-low process into the same-resolution
high-level features in the low-to-high process progres-
sively through skip connections. In cascaded pyramid net-
work [11], a globalnet combines low-to-high level features
in the high-to-low process progressively into the low-to-
high process, and then a refinenet combines the low-to-high
level features that are processed through convolutions. Our
approach repeats multi-scale fusion, which is partially in-
spired by deep fusion and its extensions [67, 73, 59, 80, 82].
Intermediate supervision. Intermediate supervision or
deep supervision, early developed for image classifica-
tion [34, 61], is also adopted for helping deep networks
training and improving the heatmap estimation quality,
e.g., [69, 40, 64, 3, 11]. The hourglass approach [40] and
the convolutional pose machine approach [69] process the
intermediate heatmaps as the input or a part of the input of
the remaining subnetwork.
Our approach. Our network connects high-to-low sub-
networks in parallel. It maintains high-resolution repre-
sentations through the whole process for spatially precise
heatmap estimation. It generates reliable high-resolution
representations through repeatedly fusing the representa-
tions produced by the high-to-low subnetworks. Our ap-
proach is different from most existing works, which need
a separate low-to-high upsampling process and aggregate
low-level and high-level representations. Our approach,
without using intermediate heatmap supervision, is superior
in keypoint detection accuracy and efficient in computation
complexity and parameters.
There are related multi-scale networks for classification
and segmentation [5, 8, 74, 81, 30, 76, 55, 56, 24, 83,
55, 52, 18]. Our work is partially inspired by some of
them [56, 24, 83, 55], and there are clear differences making
them not applicable to our problem. Convolutional neural
fabrics [56] and interlinked CNN [83] fail to produce high-
quality segmentation results because of a lack of proper de-
sign on each subnetwork (depth, batch normalization) and
multi-scale fusion. The grid network [18], a combination
of many weight-shared U-Nets, consists of two separate fu-
sion processes across multi-resolution representations: on
the first stage, information is only sent from high resolution
to low resolution; on the second stage, information is only
sent from low resolution to high resolution, and thus less
competitive. Multi-scale densenets [24] does not target and
cannot generate reliable high-resolution representations.
3. Approach
Human pose estimation, a.k.a. keypoint detection, aims
to detect the locations of K keypoints or parts (e.g., elbow,
wrist, etc) from an image I of size W × H × 3. The state-
of-the-art methods transform this problem to estimating K
heatmaps of size W
0
×H
0
, {H
1
, H
2
, . . . , H
K
}, where each
heatmap H
k
indicates the location confidence of the kth
keypoint.
We follow the widely-adopted pipeline [40, 72, 11] to
predict human keypoints using a convolutional network,
which is composed of a stem consisting of two strided con-
volutions decreasing the resolution, a main body outputting
the feature maps with the same resolution as its input fea-
ture maps, and a regressor estimating the heatmaps where
the keypoint positions are chosen and transformed to the
full resolution. We focus on the design of the main body
and introduce our High-Resolution Net (HRNet) that is de-
picted in Figure 1.
Sequential multi-resolution subnetworks. Existing net-
works for pose estimation are built by connecting high-to-
low resolution subnetworks in series, where each subnet-
work, forming a stage, is composed of a sequence of con-
volutions and there is a down-sample layer across adjacent
subnetworks to halve the resolution.
Let N
sr
be the subnetwork in the sth stage and r be the
resolution index (Its resolution is
1
2
r−1
of the resolution of
the first subnetwork). The high-to-low network with S (e.g.,
4) stages can be denoted as:
N
11
→ N
22
→ N
33
→ N
44
.
(1)
Parallel multi-resolution subnetworks. We start from a
high-resolution subnetwork as the first stage, gradually add
high-to-low resolution subnetworks one by one, forming
new stages, and connect the multi-resolution subnetworks
in parallel. As a result, the resolutions for the parallel sub-
networks of a later stage consists of the resolutions from the
previous stage, and an extra lower one.
An example network structure, containing 4 parallel sub-
networks, is given as follows,
N
11
→ N
21
→ N
31
→ N
41
& N
22
→ N
32
→ N
42
& N
33
→ N
43
& N
44
.
(2)