4
Fig. 2. A Faster R-CNN with Feature Pyramid Network. The input image is fed to the backbone network, then the feature pyramid network (light
yellow) computes multi-scale features. The region proposal network proposes candidate boxes, which are filtered with non-maximum suppression
(NMS). Features for the remaining boxes are pooled with RoIAlign and fed to the box head, which predicts object category and refined box
coordinates. Finally, redundant and low-quality predictions are removed with NMS. Blue labels are class names in the detectron2 implementation.
Figure courtesy of Hiroto Honda. https://medium.com/@hirotoschwert/digging-into-detectron-2-47b2e794fabd
CNN
set of image features
transformer
encoder
…
…
positional encoding
+
transformer
decoder
class,
box
class,
box
no
object
no
object
FFN
FFN
FFN
FFN
object queries
backbone encoder decoder prediction heads
Fig. 3. The DETR object detector. The image is fed to the backbone, then positional encodings are added to the features and fed to the transformer
encoder. The decoder takes as input object query embeddings, attends to the encoded representation, and outputs a fixed number of object
detections, which are finally thresholded, without need for NMS [9]. Image courtesy of Carion et al. [9].
mean average precision (mAP) evaluation metric and the
differences between Pascal VOC and MS COCO implemen-
tations.
4 FEW-SHOT OBJECT DETECTION
Informally, few-shot object detection (FSOD) is the task of
learning to detect new categories of objects using only one
or a few training examples per class. In this section, we
describe the FSOD framework, its differences with few-
shot classification, common datasets, evaluation metrics,
and FSOD methods. We provide a taxonomy of popular
few-shot and self-supervised object detection methods in
Figure 1.
4.1 FSOD Framework
We formally introduce the dominant FSOD framework, as
formalized by Kang et al. [54] (Figure 4). FSOD parti-
tions objects into two disjoint sets of categories: base or
known/source classes, which are object categories for which
we have access to a large number of training examples; and
novel or unseen/target classes, for which we have only a
few training examples (shots) per class. In the vast majority
of the FSOD literature, we assume that the object detector’s
backbone has already been pretrained on an image classi-
fication dataset such as ImageNet (usually a ResNet-50 or
101). Then, the FSOD task is formalized as follows:
(1) Base training.
2
Annotations are given only for the base
classes, with a large number of training examples per
class (bikes in the example). We train the FSOD method
on the base classes.
2. In the context of self-supervised learning, base-training may also
be referred to as finetuning or training. This should not be confused with
base training in the meta-learning framework; rather this is similar to the
meta-training phase [32].