4 Chunfang Deng, Mengmeng Wang, Liang Liu, and Yong Liu
pre-defined anchor boxes. Recently, anchor-free frameworks [13,38,31,39] also be-
come increasingly popular. Despite of the development of deep object detectors,
small object detection remains an unsolved challenge. Dilated convolution [34]
is introduced in [23,17,16] to augment receptive fields for multi-scale detection.
However, general detectors tend to focus more on improving the performance
of easier large instances, since the metric of general object detection is average
precision of all scales. Detectors specialized for small objects still need more
exploration.
2.2 Cross-Scale Features
Utilizing cross-scale features is an effective way to alleviate the problem arising
from object scale variation. Building image pyramids is a traditional approach to
generating cross-scale features. Use of features from different layers of network
is another kind of cross-scale practice. SSD [24] and MS-CNN [4] detect objects
of different scales on different layers of CNN backbone. FPN [19] constructs
feature pyramids by merging features from lower layers and higher layers via
a top-down pathway. Following FPN, FPN variants explore more information
pathways in feature pyramids. PANet [22] adds an extra down-top pathway to
pass shallow localization information up. G-FRNet [1] introduces gate unit on
the pathway, which passes crucial information and block ambiguous information.
NAS-FPN [6] delves into optimal pathway configuration using AutoML. Though
these FPN variants improve the performance of multi-scale object detection, they
continue to use the same number of layers as original FPN. But these layers are
not suitable for small object detection, which leads to still poor performance of
small objects.
2.3 Super-Resolution in Object Detection
Some studies introduce SR to object detection, since small object detection al-
ways benefits from large scales. Image-level SR is adopted in some specific situa-
tions where extremely small objects exist, such as satellite images [15] and images
with crowded tiny faces [2]. But large-scale images are burdensome for subse-
quent networks. Instead of super-resolving the whole image, SOD-MTGAN [3]
only super-resolves the area of RoIs, but large quantities of RoIs still need con-
siderable computation. The other way of SR is to directly super-resolve features.
Li et al. [14] use Perceptual GAN to enhance features of small objects with the
characteristics of large objects. STDN [37] employs sub-pixel convolution on top
layers of DenseNet [12] to detect small objects and meanwhile reduce network
parameters. Noh et al. [25] super-resolve the whole feature map and introduce
supervision signal to training process. Nevertheless, above-mentioned feature SR
methods are all based on restricted information from a single feature map. Re-
cent reference-based SR methods [35,36] have capacity of enhancing SR images
with textures or contents from reference images. Enlightened by reference-based
SR, we design a novel module to super-resolves features under the reference of