BaggingPCA：一种学习二进制代码的新方法

需积分: 9 58 浏览量更新于2024-07-14 收藏 531KB PDF 举报

"这篇研究论文提出了一种名为‘袋装PCA’(BaggingPCA)的新方法，用于学习有效的二进制代码。针对基于主成分分析(PCA)的哈希方法中存在的问题，即不同维度间信息捕获不平衡，大部分信息集中在顶部特征向量中，导致编码长度增加并不一定能提高性能的现象，论文作者尝试将Bootstrap抽样思想与PCA结合，生成较短的编码，然后通过多次重复此过程并串联得到的短码，生成更长的二进制代码。" 在传统基于主成分分析的哈希方法中，信息在不同的主成分维度之间分配不均，通常最显著的特征向量捕获了大部分信息。这导致了一个问题，即随着编码长度的增加，性能的提升并不总是线性的，甚至可能出现性能下降的情况。为了解决这个问题，本研究引入了Bootstrap抽样策略，并将其与PCA相结合，创新性地提出了BaggingPCA。 Bootstrap抽样是一种统计学中的重要技术，通过从原始数据集中随机抽取样本（允许重复）来创建多个“子样本”，从而可以对总体参数进行估计或模型进行训练。在 BaggingPCA 中，研究人员每次从训练数据中随机抽取一部分数据来学习PCA的方向，只保留顶部的主成分来生成一个较短的编码。这个过程会重复多次，每次生成的短编码都会被连接起来，形成一个较长的二进制代码。通过这种方式，BaggingPCA旨在分散原本集中在少数主成分上的信息，使得每个短码都能捕获到数据的不同方面，从而在整个长码中实现信息的均衡分布。这样做有望提高编码的表示能力和检索性能，尤其是在高维数据的近似最近邻搜索、图像识别和大规模数据集的分类等应用中。此外，Bootstrap抽样的引入还有助于减少过拟合的风险，因为每个短码都是基于不同的子样本学习的，这在一定程度上增加了模型的泛化能力。最后，由于BaggingPCA只需要保留顶部的主成分，因此计算复杂度相对较低，适用于处理大规模数据集。 "使用袋装PCA学习二进制代码"这篇研究论文提出了一个新颖的哈希学习框架，通过Bootstrap抽样与PCA的结合，旨在优化二进制编码的质量和效率，这对于大数据时代的机器学习和计算机视觉任务具有重要的理论和实践意义。

180 C. Leng et al.

As one of the most popular hashing methods, LSH randomly samples hash functions

from a locality sensitive function family [1]. SimHash [4,13] and MinHash [3,20] are

two widely adopted LSH schemes. MinHash is a technique for quickly calculating the

Jaccard coefﬁcient of two sets by estimating the resemblance similarity deﬁned over

binary vercotrs [3]. In contrast, SimHash is an LSH for the similarities (e.g., cosine

similarity) which work on general real-valued data. As indicated in [21], when the data

are high-dimensional and binary, MinHash tends to work better than SimHash. On the

other hand, SimHash achieves better performance than MinHash on real-valued data.

Speciﬁcally, to approximate the cosine similarity, Charikar [4] deﬁned a hash function

h as:

h(q)=



1, if w · q>0

0, if w · q<0

(1)

where w is a random vector from the d-dimensional Gaussian distribution N(0,I

Although with abundant nice theoretical properties, these random projection based data-

independent hashing methods are less discriminative over data and typically need very

long codes to achieve satisfactory search performance.

Recently, many data-dependent hashing methods [26,18,15,11,24,29,16,7,9] have

been proposed to learn data aware hash functions. As we have mentioned, many of

them [26,18,24,29,8,7] are based on eigendecomposition of a matrix (e.g. Laplacian

matrix). This brings the unbalance problem because the information caught in differ-

ent eigenvectors is unbalanced. A few recent works have been proposed to address this

problem.

In [25], instead of learning all the eigenvectors at once, Wang et al. proposed a se-

quential learning framework (USPLH) to learn hash function which tends to minimize

the errors made by the previous one. Inspired by multiclass spectral clustering [28],

in Iterative Quantization (ITQ) [7], Gong et al. proposed an alternating minimization

scheme to learn an orthogonal transformation to the PCA projected data so as to min-

imize the quantization error of mapping the data to their corresponding binary codes

(the vertices of binary hypercube). In Isotropic Hashing (IsoH) [15], Kong et al. pro-

posed to learn projection functions which can produce projected dimensions with equal

variances. Same as in ITQ, they tried to learn an orthogonal transformation to the PCA

projected data by iteratively minimizing the reconstruction error of the covariance ma-

trix and a diagonal matrix. Similar idea was adopted in [27], but in which the PCA

projection was replaced with locality preserving projection (LPP) [10]. In these meth-

ods, longer codes often catch much more information and thus give better experimental

results than the shorter ones. However, to the best of our knowledge, there still lack

enough theoretical guarantee that the performance will be better as the size of codes

increases, like in LSH.

The differences between our method and the previous works are obvious. Instead of

minimizing the quantization error or toughly requiring each dimension to have equal

variance, we leverage the bootstrap sampling scheme and integrate it with PCA. Every

time only the informative top eigenvectors are used to learn short binary code. Owing to

the sophisticated theories established in ensemble learning, our method enjoys several

advantages which are lacking from previous works.

剩余15页未读，继续阅读

weixin_38706603

粉丝: 10

BaggingPCA：一种学习二进制代码的新方法

二进制学习

二进制PCA论文（英文版）

l-曲线matlab代码-sh-bdnn:“学习使用二进制深度神经网络进行哈希处理”的源代码-ECCV2016（SH-BDNN）

代码_pca伪代码_pca_PCA算法伪代码_主成分分析pca_

PCA.zip_PCA python实现_PCA 代码_loudi4x_pca python代码_python pca源代码

PCA的MATLAB仿真代码-ML-Tutorial-Experiment:编写机器学习教程以学习学习

使用本地二进制模式和主成分分析的Android恶意软件检测

PCA和LDA联合代码

PCA降维，python代码

PCA分类算法分析代码

最新资源