Computational Intelligence and Neuroscience
T : An illustrative example of permutation of feature vectors.
/ indicates absence/presence of features in each instance.
Instance
Feature
bacdf e
A 1
B 1
C 1
D 1
E 1
feature vector in Table is ;supposethepermuted
feature vector is ; then feature vectors for , , , ,
and become (100001),(010011),(100101),(001100),and
(000110)as illustrated in Table .ustheminhashvaluefor
, , , ,andis , , , , and , respectively.
We can choos e independent permutations
1
,
2
,...,
𝑛
.Supposetheminhashvalueofaninstance
𝑖
for a certain
permutation
𝑗
is denoted by min
𝑗
(
𝑖
); then the signature
denoted by Sig(
𝑖
)is
Sig
𝑖
=
min
1
𝑖
,min
2
𝑖
,...,min
𝑛
𝑖
.
()
e approximate similarity between two instances based
on their signatures is dened as the percentage of identical
values at the same position in the corresponding signatures.
For example, given =6,Sig(
1
) = (2,1,5,0,3,2),and
Sig(
2
)=(2,1,3,2,8,0), the approximate Jaccard similarity
is sim(
1
,
2
)≈2/6=0.33.
2.2. Banding. Givenalargesetofsignaturesgeneratedin
Section ., it is still too costly to compare similarities for all
signature pairs. erefore, a banding technique is presented
consequently to lter dissimilar pairs.
e banding technique divides each signature into
bands,whereeachbandconsistsof elements. For each band
of every signature, the banding technique maps the vector of
elements to a bucket array.
As shown in Figure ,theth band of each signature maps
to bucket array .Intuitively,ifforapairofsignatures,the
corresponding bucket arrays have at least one bucket array in
common, then the pair is likely to be similar. For example,
signature and signature and signature and signature
in Figure are similar. Such a pair with common bucket array
is considered to be a candidate pair and needs to be veried
in the banding technique.
3. Personalized LSH
3.1. New Banding Technique. e candidates generated by
LSH are not guaranteed to be similar pairs. Chances are that
apairofsignaturesareprojectedtoidenticalbucketarrays
even if the Jaccard similarity between the pair of instances
is not larger than the given threshold. In the meantime, a
pair of instances can be ltered out from candidates since
their corresponding signatures are projected into disjoint
bucket arrays even if the Jaccard similarity is smaller than the
given threshold. e former case is called false positive, while
the latter one is called false negative. Massive false positives
will lead to inaccurate results, while a large amount of false
negatives will deteriorate computational eciency of LSH. To
enhance the algorithm precision and eciency, we present
here a new banding scheme to lter more dissimilar instance
pairs. Intuitively, if two instances are highly alike, it is possible
that many bands of the two corresponding signatures are
mapped to identical buckets. For example, in Figure ,there
are at least bands (i.e., the st, the th, and the th bands)
of signature and signature which map to the same buckets
(i.e., in the corresponding bucket array , , ).
erefore, we change the banding scheme as follows. For
any pair of instances, if the two corresponding signatures do
not map into at least (∈[1,])identicalbuckets,itwillbe
ltered out. Otherwise, it is considered to be a candidate pair
and the exact Jaccard similarity is computed and veried. For
the signatures shown in Figure ,given=3,signature1and
signature and signature and signature are ltered.
3.2. Number of False Positives. Acandidatepair
1
,
2
is
false positive, if sim(
1
,
2
) < and
1
,
2
share at least
common bucket arrays. Since the eciency of LSH is
mainly dependent on the number of false positives, and most
real applications demand a high precision, we rst derive
the possible number of false positives generated by the new
banding technique.
Lemma 1. e upper bound of false positives generated by the
new banding technique is equal to the original LSH and the
lower bound is approximate to 0.
Proof. According to the law of large numbers, the probability
that the minhash values of two feature vectors (e.g.,
1
,
2
)are
equal under any random permutation , is very close to the
frequency percentage of observing identical value in the same
position at two long signatures of the corresponding feature
vectors. at is,
?min
1
=min
2
= lim
𝑛→+∞
?
Sig
1
𝑟
=Sig
2
𝑟
Sig
1
,
()
where is the length of signatures Sig(
1
)and Sig(
2
); is the
position in signatures, ∈[1,].
Also, the probability that a random permutation of two
featurevectorsproducesthesameminhashvalueequalsthe
Jaccard similarity of those instances []. at is,
?min
1
=min
2
=
1
∩
2
1
∪
2
=sim
1
,
2
.
()
Basedontheabovetwoequations,theprobabilityof
two instances with Jaccard similarity is considered to be a
candidate pair by the new banding technique denoted by
new
as
new
(
)
=1−
𝑘−1
𝑖=0
𝑟
𝑖
1−
𝑟
𝑏−𝑖
,
()