
Object detection via inner-inter relational reasoning network
He Liu, Xiuting You, Tao Wang
⁎
,YidongLi
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
abstractarticle info
Article history:
Received 28 September 2022
Accepted 17 December 2022
Available online 29 December 2022
Keywords:
Object detection
Relational reasoning
Attention model
Exploiting relationships between objects and (or) labels under graph message passing mechanism to facilitate
object detection has been widely investigated in recent years. However, these methods heavily rely on hand-
crafted graph structures, which may introduce unreliable relationships and in turn hurt the object detection
performance. Aiming to address this issue, we propose a novel object detection framework that fully explores
the relational representations for objects and labels under a full attention architecture. Specifically, we directly
regard the extracted proposals and candidate labels as two independent sets in visual feature space and label
embedding space, respectively. And we design a self-attention module to discover the inner-relationships within
the visual feature space or label embedding space. In addition, a cross-attention module is developed to explore
the inter-relationships between the two spaces. Furthermore, both the inner-relationships and inter-
relationships are utilized to enhance the object features and label embedding representations to facilitate the
object detection. To validate the proposed framework in improving object detection performance, we embed it
into several state-of-the-art baselines and perform extensive experiments on two public benchmarks (named
Pascal VOC and COCO 2017). The experimental results demons trate the effecti veness and flexibility of the
proposed framework.
©2023ElsevierB.V.Allrightsreserved.
1. Introduction
As a fundamental problem in image recognition community, ob-
ject detection aims to localize and classify the candidate bounding
boxes extracted f rom a given image, and has been wid ely used in
many realistic tasks such as visual survei llance [1] and automated
driving [2]. In general, object detection methods can be divid ed
into two groups including regression-based detection methods and
region-based detection methods. G iven an image, the regression-
based de tection methods take it as input and di rectly predict the
location and classificatio n of the ob jects. While the region-based
detection methods gener ally extract a series of region proposals
from Region Proposal Network (RPN) used to indicate the coarse loca-
tions of the candidate objects, and then pass the region proposals
into follow-up learnable modules to predict more precise locations
and categories.
Previous classic methods, including Faster R-CNN [3], Mask R-CNN
[4], YOLO [5] and SSD [6], deal with the location regression and classifi-
cation on the extracted proposals individually, and pay less attention
to the relationships between them, thus making the limited
representation ability and leading to unsatisfactory performance.
Recently, several works that attempt to introduce relation between in-
stances into object detection via graph message passing mancin ism
have been proposed. For example, Liu et al. [7] explore the relationship
between global scene context and individual objects and enhance the
region feature using recurrent neural network (RNN). Li et al. [8] estab-
lish the relationship between image feature maps in feature pyramid
networks (FPN), and propose a dynamic feature fusion method based
on the graph convolution network (GCN) to enrich the representation
of image feature maps. In addition, several works [9–11] consider estab-
lishing the spatial position relationship of the region proposals
extracted from the RPN, and enhance the features of region proposals
via GCNs. Similarly, Li et a l. [12] introduced global scene features on
region-based relation graph, which ma kes the region proposal learn
both local and global features to enhance feature representation of re-
gion. Different from the above methods that mainly focus on exploring
the relationships within the visual feature space, several works [13,14]
consider es tablishing the relationship between category labels on a
constructed label graph, and enhance the feature representation of re-
gions by fusing the information of neighbors to improve the detection
performance of the detector.
Although the above methods have effectively improved the detec-
tion performance, they heavily rely on the heuristically generated
graph structure, which may impose noisy relationships in the graph
Image and Vision Computing 130 (2023) 104615
⁎ Corresponding author.
E-mail addresses: liuhe1996@bjtu.edu.cn (H. Liu), yxting@bjtu.edu.cn (X. You),
twang@bjtu.edu.cn (T. Wang), ydli@bjtu.edu.cn (Y. Li).
https://doi.org/10.1016/j.imavis.2022.104615
0262-8856/© 2023 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
Image and Vision Computing
journal homepage: www.elsevier.com/locate/imavis