提升视觉匹配精度：嵌入多阶空间线索的可扩展方法

113 浏览量更新于2024-08-27 收藏 1.83MB PDF 举报

本文探讨了"嵌入多阶空间线索以进行可扩展的视觉匹配和检索"这一主题，发表在2014年3月的《IEEE Journal on Emerging and Selected Topics in Circuits and Systems》上。作者Shiliang Zhang、Qi Tian（均为IEEE高级会员）和Qingming Huang（同样为IEEE高级会员）、Yong Rui（IEEE院士）共同研究了在图像处理领域中的关键问题——如何提升视觉内容的重复匹配效率和准确性。当前，研究人员已经提出了多种局部特征描述符用于图像匹配，如SIFT和SURF这样的浮点描述符，以及ORB和BRIEF这样的二进制描述符。然而，这些方法各有局限性：浮点描述符虽然具有较高的匹配精度，但计算成本相对较高；而二进制描述符虽然简洁，但往往牺牲了一定的鲁棒性和匹配精度。本文的创新之处在于提出了一种新颖的局部特征——多阶视觉短语。这种短语融合了两种互补的线索：一是每个图像关键点中心的视觉信息，二是多个邻近关键点的视觉和空间信息。通过这种方式，文章旨在同时增强浮点描述符的效率和精确性，以及提高二进制描述符的匹配性能。多阶视觉短语的设计不仅考虑了局部特征的精确度，还考虑到了其周围环境的上下文信息，从而提高了对复杂场景中视觉相似性的识别能力。多阶空间线索的嵌入方法旨在解决传统描述符在处理大规模数据集时可能遇到的扩展性问题，通过引入更复杂的结构，能够在保持高效性的前提下增强匹配的稳定性和一致性。这种方法可能会对诸如目标检测、图像搜索、三维重建等视觉任务产生显著影响，有助于提高计算机视觉系统的整体性能。这项研究为图像匹配和检索技术提供了一种创新的解决方案，有望推动该领域的技术发展，并为未来的实时和大规模视觉应用奠定坚实的基础。

132 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014

be norm alized into ﬁxed orientation and size, hence scale and

orientation invariant local descriptors could be extracted after-

ward. Based on SIFT, some other similar ﬂoating point descrip-

tors like SURF [1], P C A-SIFT [12], gradient location and o ri -

entation histogram (GLOH) [21] are proposed. According to the

reported results in [21], SIFT and one of its extensions, i.e., t h e

GLOH generally outperform the other descriptors.

Other than the above mentioned ﬂ oating point descriptors,

many efforts have been m ade to design efﬁcient and compact

descriptors alternative to SIFT or SURF in recent years. Some

researchers propose low-bitrate descriptors such as BRIEF [4],

BRISK [14], CH oG [5], Edge-SIFT [43], and ORB [26] which

are fast both to build and match.

Each bit of BRIEF [4] is computed by considering signs of

simple intensity difference tests between pairs of points sampled

from the im age patches. Despite the clear advantage in speed,

BRIEF suffers in terms of reliability and robustness as it has

limited tolerance to image distortio ns and transformations. T he

BRISK descriptor [1 4] ﬁrst efﬁciently detects interest points in

the scale-space pyramid based on a detector inspired by FA ST

[25] and AGAST [19]. Then given a set of detected keypoints,

BRISK descriptor is composed as a binary string by concate-

nating the results of simple brightness comparison tests. The

adopted detector of BRISK gets location, scale, and orienta-

tion clues for each interest point. Hence BRISK achieves orien-

tation-invariance and scale-invariance. O RB descriptor is built

based on BRIEF but is extracted with a novel corner point de-

tector, i.e., o-FA ST [26], and is more robust to the rotation and

image noises. The auth ors demonstrate that ORB is sign iﬁcantly

faster than SIFT, while performs as well in many situations [2 6].

Compared with SIFT, desp ite of the advantage in speed, these

compact descriptors show lim itations in the aspects of descrip-

tive power, robustness, or generality.

B. Local Feature Based Image Matching and Retrieval

Local descriptors like SIFT [18], SURF [1], ORB [26], etc.,

are originally pro posed for image matching. As discussed by

Lowe in [18], SIFT descriptors can be m atched with two cri-

teria, i.e., 1) the cosine similarity of matched SIFT should be

larger than a threshold, e .g., 0.83; 2) simil arit y ratio between

the closest match and the second-closest match should be larger

than a threshold, e.g., 1.3. Howev e r, this two criteria ignores the

spatial clues among keypoints, which has been proven important

for visual matching. Some works improve the accuracy of local

feature matching by verifying the spatial consistency among

matched l ocal features, i .e., the correctly matched local descrip-

tors should satisfy consistent spatial relationsh ip. As discussed

in [11], the matching accuracy of SIFT, SURF, and PCA-SIFT

is su bstant iall y improve d with RANSAC [7], which estimates

an afﬁne transforma tion between two images and rejects the

matched local descriptors violating this transformation.

Matching raw local descriptors is expensive to com -

pute because each image may contains over one thousand o f

high-dimensional descriptors. Meanwhile, exploiting geometric

relationships with full geometric veriﬁcation like RANSAC [7]

improves retrieval precision, but is computationally expensive.

Therefore, raw SIFT and RANSAC are not ideal for large-scale

applications. An efﬁcient solution is to quantize local descrip-

tors into visual words and generate BoWs model. It has b een

illustrated that single visual word can not preserve th e spatial

information in images, which has been proven important for

visual matching and recognition [30], [42], [40], [47], [46],

[45], [41 ], [6]. To combine BoWs with spatial information, lots

of works a re conducted to combine m ultiple visual words with

spatial information [3 0], [42], [40], [47], [46], [4 5], [41], [6].

This m ay be achieved, for example, b y using feature pursuit

algorithms such as AdaBoosting [32], a s demonstrated b y

Liu et al. [17]. Visual word correlogram and correlation [27],

which are leveraged f rom the color correlogram, are utilized to

model the spatial relationships among visual words for object

recognition. In [38], visual word s are bundled and the co rre-

sponding image indexing and visual word matching algorithms

are proposed for large-scale near-duplicate image retrieval. In

[47], the spatial conﬁgurations of visual words are recorded in

index, which are hence utilized for spatial veriﬁcation during

online retrieval. In a recent work [6], Chu et al. utilize a spatial

graph model t o estimate the spa tial consistency among matched

visual wo rds du rin g online retrieval. Hence, the visual words

violating the spatial consistency could be effectively identiﬁed

and discarded. Visual phrases preserve extra spatial information

by involv ing multiple visual w ords, thus generally present more

advantages in large-scale image retrieval [8], [42], [45]. For

example, descriptive visual phrase (DVP) is generated in [42]

by grouping and selecting two nearby visual words. Similarly,

visual phrases proposed in [45] are extracted by discovering

multiple spatially stable visual words. Generally, considering

visual w ords in groups rather th an single visual word captures

stronger discriminative po wer.

Not withstanding the success of existing visual phrase fea-

tures, they are still limited in ﬂexility and repeatability. Cur-

rently, each visual phrase is commonly treated as a whole cell

and tw o of them are matched only if they contain identical single

visual words [42], [45]. Firstly, this m eans only visual phrases

containing the same number of visual words could be consid-

ered for matching. Secondly, because of quantization error, visu-

ally similar descriptors might be quantized into different visual

words.Suchquantizationerrorwouldbeaggregatedinvisual

word com bin ations, and m akes visual phrase match ing more

difﬁcult to occur.

Different from existing visual phrase f eatu res, the proposed

multi-order visual phrase captures rich spatial clues and allows

more ﬂexible matching without scarifyi ng the repeatability of

BoWs model. Thus, it is superior in the aspects of ﬂexibility,

quantization error, and efﬁciency. Our approach also diff ers

with existing spatial veriﬁcation methods in that, it does not

compute graph models [6] or needs multiple iterations with

matrix operation as in [47], and is ﬁnished in a cascade manor.

Hence, it is potential to achieve better efﬁciency. Besides that,

our approach does not keep or discard a candidate m atch , but

reasonably estimates a conﬁdence of correctness for it. This is

helpful to improve the recall rate and decrease the quantization

error. In the following section, we introduce how we extract

multi-order visual phrase from SIFT descriptors and binary

ORB descriptors.

剩余11页未读，继续阅读

weixin_38686187

粉丝: 8
资源: 965

提升视觉匹配精度：嵌入多阶空间线索的可扩展方法

使用CLIP对图像和句子进行可扩展的嵌入、推理和排名

基于视觉嵌入和空间约束的单词图像表示在历史文献中的关键词发现

CVSE:用于图像-文本匹配的共识感知视觉语义嵌入论文的官方源代码（ECCV 2020）

使用视觉单词嵌入和RNN表示单词图像以在历史文档图像上发现关键字

vearch：用于基于嵌入的检索的分布式系统

基于词嵌入与扩展词交集的查询扩展.docx

WPF中嵌入Exe程序

c#窗体嵌入应用程序

零样本图像检索的视觉-语义嵌入方法综述

层次化多模态LSTM：视觉语义嵌入的创新解决方案

最新资源