优化局部图像描述子的判别学习方法

需积分: 13 40 浏览量更新于2024-07-21 收藏 5.34MB PDF 举报

"这篇论文探讨了从训练数据中学习局部图像特征的方法。研究中提出了一组构建图像描述符的基础模块，这些模块可以组合并联合优化，以最小化最近邻分类器的错误。论文考虑了线性和非线性变换，以及降维技术，并利用线性判别分析（LDA）和鲍威尔最小化等判别学习技术来求解参数。通过这些技术，获得了在低维度下超越现有最优性能的描述符。除了新的实验和特征学习建议外，还提供了一个基于多视图立体数据的新而真实的地面实况数据集。关键词包括：图像特征、局部特征、判别学习、SIFT。" 在计算机视觉领域，局部图像特征匹配已成为识别和注册任务的核心方法。传统的局部特征如SIFT（尺度不变特征变换）虽然表现出色，但通常依赖于手工设计，而这篇论文则关注于从数据中学习这些特征。作者提出了一个框架，允许构建可组合和优化的描述符模块，这是对经典特征提取方法的重要改进。判别学习是论文中的核心概念，它不同于传统的生成模型，更专注于直接优化分类或回归任务的性能。在这里，线性判别分析（LDA）被用于寻找最佳投影方向，使得不同类别的样本在新空间中的分离度最大化。同时，鲍威尔最小化算法用于求解非线性问题的参数，这种优化方法对于处理复杂模型和非凸优化问题特别有用。通过这些学习策略，论文能够获得在保持低维度表示的同时，仍能超越当前最佳性能的图像描述符。低维度意味着更高效的存储和计算，这对于大规模图像处理任务尤其重要。此外，论文提供的新数据集为多视图立体数据，这为研究人员提供了更加真实和具有挑战性的环境来测试和验证新的局部特征学习方法。论文的贡献不仅仅是新的算法和理论，还包括对特征学习的实践建议，以及一个宝贵的、基于现实场景的数据集。这将促进后续研究的发展，推动局部特征领域进一步向前，特别是在3D重建、图像配准、目标识别等领域有着广泛的应用潜力。这项工作为提高图像分析的准确性和效率提供了新的途径，对于推动计算机视觉领域的进步具有重要意义。

Certain of these algorithmic combinations give rise to

published descriptors, but many are untested. Using this

structure allows us to examine the contribution of each

building block in detail and obtain a better covering of the

space of possible algorithms.

Our approach to learning descriptors is therefore to put

together a combination of building blocks and then optimize

the parameters of these blocks using learning to obtain the

best match/no-match classification performance. This con-

trasts with prior attempts to hand tune descriptor para-

meters and helps to put each algorithm on the same footing

so that we can obtain and compare best performances.

Fig. 3 shows the overall learning framework for building

robust local image descriptors. The input is a set of image

patches which may be extracted from the neighborhood of

any interest point detector. The processing stages consist of

the following:

. G-block: Gaussian smoothing is applied to the input

patch.

. T-blocks: We perform a range of nonlinear trans-

formations to the smoothed patch. These include

operations such as angle-quantized gradients and

rectified steerable filters, and typically resemble the

“simple-cell” stage in human visual processing.

. S-blocks/E-blocks: We perform spatial pooling of

the above filter responses. S-blocks use parametrized

pooling regions, E-blocks are nonparametric. This

stage resembles the “complex-cell” operations in

visual processing.

. N-blocks: We normalize the output patch to account

for photometric variations. This stage may option-

ally be followed by another E-block to reduce the

number of dimensions at the output.

In general, the T-block stage extracts useful features from

the data like edge or local frequency information and the

S-block stage pools these features locally to make the

representation insensitive to positional shift. These stages

are similar to the simple/complex cells in the human visual

cortex [36]. It’s important that the T-block stage introduces

some nonlinearity, otherwise the smoothing step amounts

to simply blurring the image. Also, the N-block normal-

ization is critical, as many factors such as lighting,

reflectance, and camera response have a large effect on

the actual pixel values.

These processing stages have been combined into three

different pipelines, as shown in the figure. Each stage has

trainable parameters which are learned using our ground

truth data set of match/nonmatch pairs. In the remainder of

this section, we will take a more detailed look at the

parametrization of each of these building blocks.

3.1 Presmoothing (G-Block)

We smooth the image pixels using a Gaussian kernel of

standard deviation 

as a preprocessing stage to allow the

descriptor to adapt to an appropriate scale relative to the

BROWN ET AL.: DISCRIMINATIVE LEARNING OF LOCAL IMAGE DESCRIPTORS 45

Fig. 1. Generating ground truth correspondences. To generate the ground truth image correspondences needed as input to our algorithms, we use

multiview stereo data provided by Goesele et al. [30]. Interest points are detected in the reference image and transferred to each neighboring

image via the depth map. If the projected point is visible, we look for interest points within a specified range of position, orientation, and scale, and

declare these to be matches. Points lying outside of twice this range are declared to be nonmatches. This is the basic input to our learning

algorithms. (a)-(f) Reference image, neighbor image, reference matches, neighbor matches, depth map, and visibility map, respectively.

剩余14页未读，继续阅读

u010453586

粉丝: 0
资源: 4

优化局部图像描述子的判别学习方法

[CVPR 2018]Discriminative Learning of Latent Features for Zero-Shot Recognition

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

discriminative learning

Graph-based discriminative features learning for fine-grained image retrieva

learning deep features for discriminative localization

基于卷积神经网络的图像识别外文翻译

解释以下这段话F types of features from the multiple samples are treated using the iteration solution (3)−(7) to generate discriminative models , respectively.

Bag of Tricks and A Strong Baseline for Deep Person Re-identification

深度学习无监督图像分割综述

最新资源