拟合频谱哈希：优化二进制哈希方法

需积分: 9 114 浏览量更新于2024-08-13 收藏 974KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"拟合频谱哈希 (FittedSpectralHashing) 是一种改进的二进制哈希方法，旨在解决传统频谱哈希(Spectral Hashing, SpH)中对数据来源假设过于严格的局限性。传统的SpH假设数据是从多维均匀分布中采样的，但实际应用中这一假设往往不成立。拟合频谱哈希通过将任意分布的一维数据映射到均匀分布，同时保持数据项间的局部邻接关系不变，来放宽这一假设。该方法发现这种映射在主成分分析(PCA)方向上具有规律性模式，并能用S曲线函数、Sigmoid函数或傅立叶函数进行良好拟合。由此，论文提出了基于Sigmoid和傅立叶函数的两种二进制哈希方法，并通过实验验证了其效果优于现有的最新技术。" 拟合频谱哈希(FittedSpectralHashing)的核心在于对数据分布的适应性改进。传统的频谱哈希方法依赖于数据的均匀分布假设，但真实世界的数据通常遵循复杂的分布，导致这种方法在处理实际问题时可能效率不高。为了克服这个问题，拟合频谱哈希采取了一种策略，即通过对数据进行某种变换，使其在一维空间中近似于均匀分布，同时确保数据的局部结构得以保留。这种变换的关键在于找到能够描述这种规律性的数学工具。研究者观察到，在PCA降维后的各个方向上，数据的分布呈现可预测的模式。这些模式可以通过特定的函数进行拟合，如S曲线函数和Sigmoid函数。这两种函数都能有效地捕捉数据的非线性特性，使得映射过程更加平滑，从而更好地将非均匀分布转化为均匀分布。此外，当有更多的参数可供使用时，傅立叶函数也被发现能有效地拟合数据，这得益于其在描述周期性变化方面的优势。通过这些函数的运用，论文提出了两种新的二进制哈希算法。这些方法在保留原始数据重要特征的同时，生成更有效的二进制编码，提高了数据检索和相似性搜索的效率。实验结果证明，拟合频谱哈希不仅在理论层面有显著的优势，而且在实际应用中也表现出超越当前最优技术的性能。总结起来，拟合频谱哈希是对传统频谱哈希方法的重要扩展，它通过引入更灵活的函数拟合，解决了数据分布假设的问题，增强了哈希编码的适用性和准确性。这种方法对于大规模数据集的快速检索和相似性匹配具有重要价值，尤其适用于图像识别、文档检索等领域的应用。

资源详情

资源推荐

Fitted Spectral Hashing

Yu Wang

1,2

, Sheng Tang

, Yalin Zhang

1,2

, JinTao Li

, DanYi Chen

Institute of Computing Technology, Chinese Academy of Sciences,Beijing 100190, China

Graduate University of Chinese Academy of Sciences, Beijing 100190, China

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,

Beijing 100081, China

{wangyu, ts, zhangyalin, jtli}@ict.ac.cn, rainachen1216@163.com

ABSTRACT

Spectral hashing (SpH) is an eﬃcient and simple binary

hashing method, which assumes that data are sampled from

a multidimensional uniform distribution. However, this as-

sumption is too restrictive in practice. In this paper we

propose an improved method, Fitted Spectral Hashing, to

relax this distribution assumption. Our work is based on

the fact that one-dimensional data of any distribution could

be mapped to a uniform distribution without changing the

local neighbor relations among data items. We have found

that this mapping on each PCA direction has certain reg-

ular pattern, and could ﬁt data well by S-Curve function,

Sigmoid function. With more parameters Fourier function

also ﬁt data well. Thus with Sigmoid function and Fourier

function, we propose two binary hashing methods. Experi-

ments show that our methods are eﬃcient and outperform

state-of-the-art methods.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Retrieval

models

General Terms

Algorithms, Experimentation

Keywords

Spectral hashing, Sigmoid function, Fourier function

1. INTRODUCTION

Similarity search is an essential problem in the ﬁeld of

machine learning, computer vision and information retrieval.

However, with increasing amounts of data, similarity search

faces following challenges: eﬃcient storing millions of items

in memory and quickly ﬁnding similar items to a query item.

Recent work [1] shows that binary hashing methods are a

powerful way to address those challenges:

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

MM’13, October 21–25, 2013, Barcelona, Spain.

http://dx.doi.org/10.1145/2502081.2502169.

• The highly compressed binary co des can be loaded into

main memory eﬃciently;

• Searching similar items can be extremely fast with

Hamming distances calculated by bit XOR operation:

an ordinary PC today would be able to do millions of

hamming distance computation in just a few millisec-

onds .

The basic idea of binary hashing methods is to formulate

projections from items to binary codes, so as to approxi-

mately preserve a given similarity function of interest [2].

”Good” binary codes should meet the entropy maximizing

criterion. According to the information theory [3], the max-

imal entropy of a source alphabet is attained by having a

uniform probability distribution. If the entropy of binary

codes over data set is small, it means that data are mapped

to only a small number of codes, thus rendering the codes

ineﬃcient.

However, many state-of-the-art methods do not meet this

criterion. One of the most well-known binary hashing meth-

ods is locality sensitive hashing method (E2LSH), which cal-

culates binary codes by projecting data on random vectors

with random thresholds, and as shown in [4] the hamming

distance between binary codes will asymptotically approach

the Euclidean distance between data items. The Kernel-

ized version (KLSH) [5] widens the accessibility of E2LSH

to generic normalized kernel functions. Rather than using

random vectors, the authors have pursued machine learning

approaches, e.g. the restricted Boltzmann method (RBM)

[6] and Boosting [7], to accelerate the document and image

retrieval.

When data are uniformly distributed in a hyper-rectangle,

Spectral hashing (SpH) [8], derived from the spectral graph

partitioning problem, meets the entropy maximizing crite-

rion. Bits can be calculated eﬃciently by the eigenfunc-

tions of the weighted Laplacian deﬁned on R

. This sim-

ple method outperforms above methods. However, the as-

sumption of SpH is too restrictive in practice. Like SpH,

Self-Taught Hashing (STH) [1] is also related to the spec-

tral graph partitioning, but uses ration-cut to address the

entropy maximizing criterion and applies support vector ma-

chine (SVM) to yield hash codes for out-of-sample objects.

STH can work with any data distribution, while suﬀering

with high computational cost. The binarized dimensional-

ity reduction technique Latent Semantic Indexing (LSI) [9]

and its improved version Laplacian Co-Hashing (LCH) [10]

are eﬃcient to get binary codes of documents. Via setting

the threshold to the median value of left singular vectors of

645

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38600253

粉丝: 6
资源: 904

拟合频谱哈希：优化二进制哈希方法

FSpH：拟合频谱散列，可进行有效的相似度搜索

哈希表设计 哈希表 哈希表

关于哈希set和哈希map：

2.3 哈希碰撞/哈希冲突

哈希map和哈希table的区别

哈希map 和 哈希table区别

深度哈希DH和普通哈希叙别

哈希表和哈希map哪个更快

哈希表是一种基于哈希值（Hash Value）进行快速查找的数据结构，哈希值怎么计算

一致性哈希和哈希槽分区

perl怎么实现哈希嵌套哈希

有序哈希表和无序哈希表

哈希表减小哈希冲突的方法

哈希表开放地址法中的哈希值怎么确定

哈希计算完之后的哈希值是怎么放到哈希桶里面某个位置的

在java中 哈希表会经常出现哈希碰撞吗

哈希表的哈希函数怎么通用

简单例子说明全盘哈希和文件哈希的区别

解释哈希碰撞的概念以及如何实现哈希碰撞。尝试进行一次成功的哈希碰撞。

哈希表 c# 没有顺序

最新资源

哈希表设计哈希表哈希表

哈希map 和哈希table区别

在java中哈希表会经常出现哈希碰撞吗