SuperGlue: Learning Feature Matching with Graph Neural Networks
Paul-Edouard Sarlin
Daniel DeTone
Tomasz Malisiewicz
Andrew Rabinovich
ETH Zurich
Magic Leap, Inc.
This paper introduces SuperGlue, a neural network that
matches two sets of local features by jointly finding corre-
spondences and rejecting non-matchable points. Assign-
ments are estimated by solving a differentiable optimal
transport problem, whose costs are predicted by a graph
neural network. We introduce a flexible context aggregation
mechanism based on attention, enabling SuperGlue to rea-
son about the underlying 3D scene and feature assignments
jointly. Compared to traditional, hand-designed heuris-
tics, our technique learns priors over geometric transforma-
tions and regularities of the 3D world through end-to-end
training from image pairs. SuperGlue outperforms other
learned approaches and achieves state-of-the-art results on
the task of pose estimation in challenging real-world in-
door and outdoor environments. The proposed method per-
forms matching in real-time on a modern GPU and can
be readily integrated into modern SfM or SLAM systems.
The code and trained weights are publicly available at
1. Introduction
Correspondences between points in images are essential
for estimating the 3D structure and camera poses in geo-
metric computer vision tasks such as Simultaneous Local-
ization and Mapping (SLAM) and Structure-from-Motion
(SfM). Such correspondences are generally estimated by
matching local features, a process known as data associa-
tion. Large viewpoint and lighting changes, occlusion, blur,
and lack of texture are factors that make 2D-to-2D data as-
sociation particularly challenging.
In this paper, we present a new way of thinking about the
feature matching problem. Instead of learning better task-
agnostic local features followed by simple matching heuris-
tics and tricks, we propose to learn the matching process
from pre-existing local features using a novel neural archi-
tecture called SuperGlue. In the context of SLAM, which
typically [
7] decomposes the problem into the visual fea-
ture extraction front-end and the bundle adjustment or pose
estimation back-end, our network lies directly in the middle
– SuperGlue is a learnable middle-end (see Figure
Detector & Descriptor
Deep Front-End
Back-End Optimizer
Deep Middle-End Matcher
Figure 1: Feature matching with SuperGlue. Our ap-
proach establishes pointwise correspondences from off-the-
shelf local features: it acts as a middle-end between hand-
crafted or learned front-end and back-end. SuperGlue uses a
graph neural network and attention to solve an assignment
optimization problem, and handles partial point visibility
and occlusion elegantly, producing a partial assignment.
In this work, learning feature matching is viewed as
finding the partial assignment between two sets of local
features. We revisit the classical graph-based strategy of
matching by solving a linear assignment problem, which,
when relaxed to an optimal transport problem, can be solved
differentiably. The cost function of this optimization is pre-
dicted by a Graph Neural Network (GNN). Inspired by the
success of the Transformer [
55], it uses self- (intra-image)
and cross- (inter-image) attention to leverage both spatial
relationships of the keypoints and their visual appearance.
This formulation enforces the assignment structure of the
predictions while enabling the cost to learn complex pri-
ors, elegantly handling occlusion and non-repeatable key-
points. Our method is trained end-to-end from image pairs
– we learn priors for pose estimation from a large annotated
dataset, enabling SuperGlue to reason about the 3D scene
and the assignment. Our work can be applied to a variety of
multiple-view geometry problems that require high-quality
feature correspondences (see Figure
Work done at Magic Leap, Inc. for a Master’s degree. The author thanks
his academic supervisors: Cesar Cadena, Marcin Dymczyk, Juan Nieto.