大规模数据上的个性化局部敏感哈希：MapReduce实现相似性连接

78 浏览量更新于2024-08-28 收藏 2.43MB PDF 举报

"MapReduce 基于个性化局部敏感哈希在大规模数据上的相似性连接" 文章探讨了MapReduce框架下实现个性化局部敏感哈希（Personalized Locality Sensitive Hashing, PLSH）用于处理大规模高维数据的相似性连接问题。局部敏感哈希（LSH）是一种在高维空间中查找相似对象的有效方法，它的核心在于通过哈希函数将高维数据映射到低维空间，使得相似的数据被映射到相同的哈希桶中，从而实现快速的近似相似性搜索。 LSH的优势在于其效率和近似率，但其性能取决于产生的假阳性实例（false positives）和假阴性实例（false negatives）的数量。假阳性的减少对于许多应用领域至关重要，因为它直接影响到搜索结果的精确度和系统的整体效率。而在某些特定的应用场景中，平衡假阳性和假阴性同样重要，以确保搜索的准确性和效率。文章作者Jingjing Wang和Chen Lin来自厦门大学的信息科学技术学院和深圳研究院，他们提出了一种基于MapReduce的个性化LSH方法，旨在解决大规模数据集中的相似性连接问题。MapReduce是一种分布式计算模型，它将大数据处理任务分解为可并行执行的“Map”和“Reduce”阶段，适合处理和存储海量数据。在Map阶段，PLSH算法会并行地对数据进行哈希处理，生成哈希表，以减少高维空间中的数据复杂性。在Reduce阶段，通过合并不同Map任务的结果，进一步筛选出可能的相似数据对。个性化元素可能涉及到根据数据的特性调整哈希函数，以优化特定应用场景下的误判率。文章强调，通过在MapReduce环境中应用这种个性化策略，可以在保持搜索效率的同时，更好地控制假阳性的数量，从而提高整体的相似性连接质量。此外，由于MapReduce的分布式特性，该方法可以有效地扩展到更大的数据集，处理能力强大。这篇研究文章提供了一种改进的、适用于大规模数据的相似性连接方法，通过MapReduce和个性化LSH，解决了高维数据处理中的一个重要挑战，即如何在保持效率的同时降低错误匹配的概率。这种方法对于推荐系统、文档分析等领域具有重要意义，能够提高大数据分析的精度和实用性。

Computational Intelligence and Neuroscience 

T : An illustrative example of permutation of feature vectors.

/ indicates absence/presence of features in each instance.

Instance

Feature

bacdf e

A 1 

B  1  

C 1 

D 1 

E 1 

feature vector in Table  is ;supposethepermuted

feature vector is ; then feature vectors for , , , ,

and  become (100001),(010011),(100101),(001100),and

(000110)as illustrated in Table .ustheminhashvaluefor

, , , ,andis , , , , and , respectively.

We can choos e  independent permutations 

,

,...,



𝑛

.Supposetheminhashvalueofaninstance

𝑖

for a certain

permutation 

𝑗

is denoted by min 

𝑗

(

𝑖

); then the signature

denoted by Sig(

𝑖

)is

Sig





𝑖





min 





𝑖



,min 





𝑖



,...,min 

𝑛





𝑖



()

e approximate similarity between two instances based

on their signatures is dened as the percentage of identical

values at the same position in the corresponding signatures.

For example, given =6,Sig(

) = (2,1,5,0,3,2),and

Sig(

)=(2,1,3,2,8,0), the approximate Jaccard similarity

is sim(

,

)≈2/6=0.33.

2.2. Banding. Givenalargesetofsignaturesgeneratedin

Section ., it is still too costly to compare similarities for all

signature pairs. erefore, a banding technique is presented

consequently to lter dissimilar pairs.

e banding technique divides each signature into 

bands,whereeachbandconsistsof elements. For each band

of every signature, the banding technique maps the vector of

elements to a bucket array.

As shown in Figure ,theth band of each signature maps

to bucket array .Intuitively,ifforapairofsignatures,the

corresponding bucket arrays have at least one bucket array in

common, then the pair is likely to be similar. For example,

signature  and signature  and signature  and signature 

in Figure  are similar. Such a pair with common bucket array

is considered to be a candidate pair and needs to be veried

in the banding technique.

3. Personalized LSH

3.1. New Banding Technique. e candidates generated by

LSH are not guaranteed to be similar pairs. Chances are that

apairofsignaturesareprojectedtoidenticalbucketarrays

even if the Jaccard similarity between the pair of instances

is not larger than the given threshold. In the meantime, a

pair of instances can be ltered out from candidates since

their corresponding signatures are projected into disjoint

bucket arrays even if the Jaccard similarity is smaller than the

given threshold. e former case is called false positive, while

the latter one is called false negative. Massive false positives

will lead to inaccurate results, while a large amount of false

negatives will deteriorate computational eciency of LSH. To

enhance the algorithm precision and eciency, we present

here a new banding scheme to lter more dissimilar instance

pairs. Intuitively, if two instances are highly alike, it is possible

that many bands of the two corresponding signatures are

mapped to identical buckets. For example, in Figure ,there

are at least  bands (i.e., the st, the th, and the th bands)

of signature  and signature  which map to the same buckets

(i.e., in the corresponding bucket array , , ).

erefore, we change the banding scheme as follows. For

any pair of instances, if the two corresponding signatures do

not map into at least  (∈[1,])identicalbuckets,itwillbe

ltered out. Otherwise, it is considered to be a candidate pair

and the exact Jaccard similarity is computed and veried. For

the signatures shown in Figure ,given=3,signature1and

signature and signature  and signature are ltered.

3.2. Number of False Positives. Acandidatepair

,

 is

false positive, if sim(

,

) <  and 

,

share at least

 common bucket arrays. Since the eciency of LSH is

mainly dependent on the number of false positives, and most

real applications demand a high precision, we rst derive

the possible number of false positives generated by the new

banding technique.

Lemma 1. e upper bound of false positives generated by the

new banding technique is equal to the original LSH and the

lower bound is approximate to 0.

Proof. According to the law of large numbers, the probability

that the minhash values of two feature vectors (e.g., 

,

)are

equal under any random permutation , is very close to the

frequency percentage of observing identical value in the same

position at two long signatures of the corresponding feature

vectors. at is,

?min 

=min 



= lim

𝑛→+∞



Sig 



𝑟

=Sig 



𝑟



Sig 





()

where is the length of signatures Sig(

)and Sig(

); is the

position in signatures, ∈[1,].

Also, the probability that a random permutation of two

featurevectorsproducesthesameminhashvalueequalsthe

Jaccard similarity of those instances []. at is,

?min 

=min 

=





∩





∪



=sim 

,

.

()

Basedontheabovetwoequations,theprobabilityof

two instances with Jaccard similarity is considered to be a

candidate pair by the new banding technique denoted by 

new



new

(



)

=1−

𝑘−1



𝑖=0









𝑟



𝑖

1−

𝑟



𝑏−𝑖

()

剩余13页未读，继续阅读

weixin_38548817

粉丝: 3
资源: 917

大规模数据上的个性化局部敏感哈希：MapReduce实现相似性连接

大数据之数据挖掘课程：海量数据集挖掘 04-LSH-Locality Sensitive Hashing 共52页.pdf

斯坦福大学大数据之数据挖掘课程 CS246：海量数据集挖掘 bigdata-stanford CS246.rar

An Approach for Discovering User Similarity in Social Networks Based on the Bayesian Network and MapReduce

MapReduce-based Assembly Clone Search for Reverse Engineering.pdf

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters翻译

Efficient Similarity Join Based on Earth Mover's Distance Using MapReduce

Pairwise Document Similarity in Large Collections with MapReduce

MapReduce: Simpliﬁed Data Processing on Large Clusters

MapReduce-Simplified Data Processing on Large Clusters.pdf

最新资源