Pattern Recognition 145 (2024) 109878
Available online 18 August 2023
0031-3203/© 2023 Elsevier Ltd. All rights reserved.
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Feature disentanglement in one-stage object detection
Wenjie Lin
a
, Jun Chu
a,
∗
, Lu Leng
a
, Jun Miao
a
, Lingfeng Wang
b
a
Key Laboratory of Jiangxi Province for Image Processing and Pattern Recognition Nanchang Hangkong University, Nanchang 330063, China
b
College of Information Science and Technology Beijing University of Chemical Technology, Beijing 100029, China
A R T I C L E I N F O
Keywords:
Object detection
Feature misalignment
Response alignment
Feature disentanglement
Soft sampling
A B S T R A C T
In this paper, an enhanced disentanglement module is proposed to address feature misalignment caused by
inherently irreconcilable conflicts between classification and regression tasks in Convolutional Neural Network-
based object detectors. The proposed method disentangles features in the feature pyramid network (FPN) at the
neck of the architecture. In addition, a response alignment strategy is proposed to reduce inconsistent responses
and suppress inferior predictions. Extensive experiments are performed on the MS COCO and PASCAL VOC
datasets with different backbones, confirming that the proposed method improves performance significantly.
The proposed method exhibits two main advantages over existing solutions—features are disentangled at the
neck instead of at the head, enabling comprehensive resolution of feature misalignment, and independent
outputs of the two tasks after feature disentanglement are avoided, thereby preventing response inconsistencies.
1. Introduction
Recently, convolutional neural networks (CNNs) [1] have been
widely adopted for object detection, with satisfactory performances.
CNN-based object detection models can be classified as two-stage [2,3]
and one-stage detectors [4,5]. Two-stage detectors with region pro-
posal mechanisms typically exhibit better accuracy. On the other hand,
one-stage detectors balance speed and accuracy; therefore, they are
commonly used in practical applications.
Unfortunately, an inherently irreconcilable conflict called feature
misalignment occurs between the classification and regression tasks
in object detection architectures, which degrades detection accuracy.
To address this conflict, most modern detectors use two task-specific
parameter-independent branches (Separate head), instead of a
parameter-sharing branch (Shared Head), to infer object categories and
bounding boxes. For instance, in RetinaNet [4], separate-head consist-
ing of two lightweight fully convolutional networks were introduced
into the framework. In [6], Wu et al. studied the effects of using
separate and shared heads on performance. The authors concluded
that using separate-head comprising a fully connected network for
classification and a convolution network for regression yielded the
best performance. However, even in architectures using separate-head,
features are produced based on the same proposal generated by the
region proposal network (RPN); therefore, the conflict persists. Song
et al. [7] employed a deformable pool in separate-head to construct a
task-aware spatial disentanglement head based on a Faster R-CNN [2],
which encodes the different features of two tasks based on the same
∗
Corresponding author.
E-mail addresses: chuj@nchu.edu.cn (J. Chu), leng@nchu.edu.cn (L. Leng).
proposal in the spatial dimension. However, features are entangled in
the feature pyramid network (FPN) of the neck before being transmitted
into the head; therefore, this conflict cannot be completely overcome
using the aforementioned method.
In this study, we extend the feature-disentanglement operation from
the head to the neck to address this conflict. To this end, an enhanced
disentanglement module (EDM) is proposed to replace the conventional
FPN. As depicted in Fig. 1, compared to FPN, EDM exhibits richer
semantic features for classification and more distinct edge features
around the boundary, facilitating regression. Although feature disen-
tanglement is typically conducted on two-stage detectors, the RPN
proposal is uncorrelated with the category and prefers a regression task.
Therefore, disentanglement of the RPN proposal lacks semantic infor-
mation. Fully Convolutional One-Stage Object Detection (FCOS) [5],
a popular and representative one-stage detector, exhibits satisfactory
performance and is easily modified [8]; therefore, it is selected as the
baseline for the evaluation of EDM.
Existing feature disentanglement methods output separate responses
of good quality for classification and regression, but the features of
the two tasks are independent after disentanglement. Therefore, the
responses of two tasks at the same location are typically inconsistent. As
a result, some inferior prediction results exhibit high classification con-
fidence (score), but low regression accuracy (intersection-over-union,
IoU) at the same spatial point. To resolve the problem of inconsistent
responses, examples that are good at both classification and regression
should be leveraged sufficiently. To ensure joint representation [9,
https://doi.org/10.1016/j.patcog.2023.109878
Received 30 August 2021; Received in revised form 3 August 2023; Accepted 8 August 2023