ASD-SLAM: A Novel Adaptive-Scale Descriptor Learning for Visual
SLAM
Taiyuan Ma
1
and Yafei Wang
1
and Zili Wang
2
and Xulei Liu
1
and Huimin Zhang
1
Abstract— Visual Odometry and Simultaneous Localization
and Mapping (SLAM) are widely used in autonomous driving.
In the traditional keypoint-based visual SLAM systems, the
feature matching accuracy of the front end plays a decisive role
and becomes the bottleneck restricting the positioning accuracy,
especially in challenging scenarios like viewpoint variation and
highly repetitive scenes. Thus, increasing the discriminability
and matchability of feature descriptor is of importance to
improve the positioning accuracy of visual SLAM. In this paper,
we proposed a novel adaptive-scale triplet loss function and
apply it to triplet network to generate adaptive-scale descriptor
(ASD). Based on ASD, we designed our monocular SLAM sys-
tem (ASD-SLAM) which is an deep-learning enhanced system
based on the state of art ORB-SLAM system. The experimental
results show that ASD achieves better performance on the UBC
benchmark dataset, at the same time, the ASD-SLAM system
also outperforms the current popular visual SLAM frameworks
on the KITTI Odometry Dataset.
I. INTRODUCTION
Feature matching is one of the key steps for Simultaneous
Localization and Mapping (SLAM), which in turn depends
on the quality of descriptors. The descriptors are feature
abstraction of the original pixels of the images. Effective
descriptors should be able to cope with image transformation,
illumination changes and so on while describing the image
features. Over the past decade, researches focused keypoint
descriptors based on hand-crafted solutions such as SIFT
[1], SURF [2] and ORB [3]. These descriptors still play
important roles in current popular visual SLAM frameworks
like ORB-SLAM2[4]. Among these hand-drift descriptors,
the sift descriptor has a higher matching precision, but
requires too much computation. The recent rise of deep
learning has created the opportunity to develop learning-
based and data-driven techniques of keypoint description.
According to [8], the descriptors coming from trained-CNN
outperform the hand-crafted descriptors in terms of their
invariance properties in patch verification tasks. Among the
methods based on the CNNs for keypoint description [5-7] ,
[11-17], the most famous models are DeepDesc [5], L2-Net
[6], CS L2-Net [6] and HardNet [7], they produce 128 or 256
dimensions unit eigenvectors like SIFT, ORB. All studies
about keypoint description with trained CNNs inevitably
compare their performance with hand-crafted descriptors,
and come to a common conclusion that they outperform
the hand-crafted descriptors in terms of their invariance
properties [8]. Although these learning-based descriptors
1
T. Ma, Y. Wang, X. Liu, H. Zhang are with School of Mechanical
Engineering, University of Shanghai Jiao Tong, Shanghai 200240, China
(corresponding author: Yafei Wang, e-mail: wyfjlu@sjtu.edu.cn).
2
Z. Wang is with Company of Xiao Peng, Guangzhou, China
achieve good performance in patch verification tasks, they
are not popular in practical applications. Especially, accord-
ing to a recent research [9], in some complicated tasks,
like SFM, traditional hand-crafted features (SIFT [1] and
its variants [10]) still prevail over the learned ones. The
main reason is that most researches did not consider the
specificities of the specific applications like SLAM, SFM
when designing loss functions, which made the descriptors
difficult to apply to these applications. Traditionally, most
researches focus on data augmentation or build more suitable
datasets to increase the robust to deal with illumination and
viewpoint changes in practical applications and ignore the
importance of loss function. For example, most learning-
based methods adopt Siamese losses [5],[11-13] and Triplet
losses [6][7][14-17] aiming at reducing the distance between
similar image patches and increase the distance of dissimilar
ones. Generally, Triplet losses are reported to have better
performance than Siamese losses according to [17]. However,
Triplet losses suffer from scale uncertainty, according to
[18], which is fatal to the feature matching between multiple
frames of SLAM and SFM. Therefore, in this paper, In order
to enable the descriptor to adapt to the feature matching
of consecutive frames in SLAM, we proposed an Adaptive-
Scale Triplet Loss function and apply it to Triplet Network
to better solve the problem of scale uncertainty and obtain
our adaptive-scale descriptor (ASD). Moreover, by replacing
the front end of the traditional visual SLAM framework with
ASD , we design the deep-learning enhanced SLAM system
(ASD-SLAM). We separately evaluate the performance of
ASD and the positioning accuracy of ASD-SLAM on the
public datasets. The experimental results show that ASD
achieves the better performance in patch verification tasks
and the ASD-SLAM positioning results are more accurate
than the influential monocular SLAM systems like ORB-
SLAM, LDSO. In addition, ASD is not only applied to
SLAM, but also can be extended to other similar fields like
SFM. In summary, our main contributions
1
are the following:
• We proposed an adaptive-scale triplet loss function and
applied it to triplet network to generate ASD which
achieved state-of-art performance on the public Brown
dataset.
• We design our deep-learning-enhanced SLAM system
(ASD-SLAM), and obtained better results comparing to
state-of-the-art visual SLAM systems like ORB-SLAM
and LDSO.
1
https://github.com/mataiyuan/ASD-SLAM?files=1
SALM依赖特征匹
配,特征匹配又依
赖描述子的质量,
描述子就是对图片
像素的特征提取
实际使用中在复杂
任务中还是用传统
设计的特征(SIFT)
主要原因是在设计
损失函数的时候作
者没有考虑SLAM、
SFM等实际应用的
特殊性,最终导致
描述子难以实际使
用
大多数人专注于数
据增强和设计更稳
定的数据集来增强
鲁棒性解决光照变
化和视角变化,却
忽视了loss
function的重要
性.例如,采用
Siamese losses和
triplet lossed。
后者一般效果更
好,然后triplet
有尺度不确定的缺
点,对于SLAM和
SFM的多帧特征匹
配很致命
本文设计一种自适
应尺度triplet
loss function运
用到triplet
netwwork中更好地
解决尺度不确定问
题,得到ASD
这些深度学习描述
子只适用于patch
verification
tasks,不适用于
SLAM、SFM
设计了自适应尺度
的triplet loss
function,并设计
了网络生成了ASD
设计了ASD-SLAM