相似项查找：从数据挖掘到文本相似度

需积分: 1 68 浏览量更新于2024-07-22 收藏 404KB PDF 举报

"Data Mining - 第三章：寻找相似项 (MMDS)" 在数据挖掘领域，寻找相似项是一项基础任务，其目标是分析数据以找出彼此相近的项目。本章主要探讨了这一问题，并以Web页面为例，展示了如何查找近似重复的页面，这些页面可能是抄袭内容或具有几乎相同内容但主机信息和镜像信息略有不同的镜像站点。首先，问题的表述被转化为寻找具有相对大交集的集合。在文本相似性文档的检测中，一种称为"shingling"的技术被引入，它将文本转换为小的重叠片段（shingles），从而将文本相似性问题转化为集合的相似性问题。通过比较不同文档的shingles集合，可以评估它们之间的相似度。接下来，介绍了"minhashing"技术，这是一种压缩大型集合的方法。minhashing能够保持集合的基本相似性特征，即使经过压缩，我们仍然可以从压缩后的版本中推断出原始集合的相似性。这种方法对于处理大规模数据集尤其有效，因为它减少了计算相似性的复杂度。在某些情况下，当需要的相似度程度非常高时，第3.9节介绍了一些其他的技术。这些技术能够在满足高度相似性的条件下，有效地筛选出相似项对，避免了对所有可能的项对进行逐一对比的计算瓶颈，这对于处理大量数据时非常关键，因为直接比较所有项对的相似度可能会非常耗时且不实际。此外，本章还可能涉及聚类、距离度量（如余弦相似性、Jaccard相似性等）以及降维技术，如主成分分析(PCA)或奇异值分解(SVD)，这些都能帮助我们更好地处理高维度数据，并在大规模数据集中找到相似项。 "Data Mining"的第三章深入讨论了在数据挖掘中发现相似项的各种策略和技术，包括shingling和minhashing，为理解和解决实际中的大数据相似性问题提供了理论基础和实用方法。通过这些工具，研究人员和从业者能够更有效地处理和分析大量的文本、图像或其他类型的数据，从而揭示隐藏的模式和关联。

3.3. SIMILARITY-PRESERVING SUMMARIES OF SETS 81

the signatures give the exact similarity of the sets they represent, but the esti-

mates they provide are close, and the larg er the signatures the more accurate

the estimates. For example, if we replace the 200,000-byte hashed-shingle sets

that derive from 50,000-byte documents by signatures of 1000 bytes, we can

usually get within a few percent.

3.3.1 Matrix Representation of Sets

Before explaining how it is possible to construct small signatures from large

sets, it is helpful to visualize a collection of sets as their characteristic matrix.

The columns of the matrix correspond to the sets, and the rows correspond to

elements of the universal set from which elements of the sets are drawn. Ther e

is a 1 in row r and column c if the element for row r is a member of the set for

column c. Other w ise the value in position (r, c) is 0.

Element

a 1 0 0 1

0 0 1 0

c 0 1 0 1

d 1 0 1 1

0 0 1 0

Figure 3.2 : A matrix representing four sets

Example 3.6 : In Fig. 3.2 is an example of a matrix representing sets chosen

from the universal set {a, b, c, d, e}. Here, S

= { a, d}, S

= { c }, S

= { b, d, e},

and S

= { a, c, d}. The top row and leftmost columns are not part o f the matrix,

but are present only to remind us what the rows and columns represent. 2

It is important to remember that the characteristic matrix is unlikely to be

the way the data is stor e d, but it is useful as a way to visualize the data. For one

reason not to store data as a matrix, these matrices are almost always sparse

(they have many mor e 0’s than 1’s) in practice. It saves space to represent a

sparse matrix of 0 ’s and 1’s by the positions in which the 1 ’s appear. For another

reason, the data is usually stored in some other format for other purposes.

As an example, if rows are pr oducts, and columns are customers, represented

by the set of products they bought, then this data would really appear in a

database table of purchases. A tuple in this table would list the item, the

purchaser, and probably other details about the purchase, such a s the date and

the credit card used.

3.3.2 Minhashing

The signatures we desire to construct fo r sets are composed of the results of a

large number of calcula tio ns, say s everal hundred, each of which is a “minhash”

3.3. SIMILARITY-PRESERVING SUMMARIES OF SETS 83

3. Type Z rows have 0 in both columns.

Since the matrix is sparse, most rows are of type Z. However, it is the ratio

of the numbe rs of type X and type Y rows that deter mine both SIM(S

, S

)

and the probability that h(S

) = h(S

). Let there be x rows of type X and y

rows of type Y . Then SIM(S

, S

) = x/(x + y). The reason is that x is the size

of S

∩ S

and x + y is the size of S

∪ S

Now, consider the pro bability tha t h(S

) = h(S

). If we imagine the rows

permuted randomly, and we proceed from the top, the proba bility that we shall

meet a type X row b e fore we meet a type Y row is x/(x + y). But if the

ﬁrst row from the top other than type Z rows is a ty pe X row, then surely

h(S

) = h(S

). On the other hand, if the ﬁrst row other than a type Z row

that we meet is a type Y row, then the set with a 1 gets that row as its minhash

value. However the set with a 0 in that r ow surely gets some row further down

the permuted list. Thus, we know h(S

) 6= h(S

) if we ﬁrst meet a type Y row.

We conclude the probability that h(S

) = h(S

) is x/(x + y), which is also the

Jaccard similarity of S

and S

3.3.4 Minhash Signatures

Again think of a collection of sets represented by their characteristic matr ix M.

To represent sets, we pick at random some number n of permutations of the

rows of M . Perhaps 100 permutations or several hundred permutations will do.

Call the minhash functions determined by these permutations h

, h

, . . . , h

From the column representing set S, co nstruct the minhash signature for S, the

vector [h

(S), h

(S), . . . , h

(S)]. We normally represent this list of hash-values

as a column. Thus, we can form from matrix M a signature matrix, in which

the ith column of M is replaced by the minhash signature for (the set of) the

ith column.

Note that the signature matrix has the same number of columns as M but

only n rows. Even if M is not represented explicitly, but in some compressed

form suitable for a sparse matrix (e.g., by the locations of its 1’s), it is normal

for the signature matrix to be much smaller than M.

3.3.5 Computing Minhash Signatures

It is not feasible to p e rmute a large characteristic ma trix explicitly. Even picking

a random permutation of millions or billions of rows is time-co nsuming, a nd

the necessary sorting of the rows would take even more time. Thus, permuted

matrices like that suggested by Fig. 3.3, while conceptually appealing, are not

implementable.

Fortunately, it is possible to simulate the eﬀect of a random permutation by

a random hash function that maps row numbers to as many buckets as there

are rows. A hash function that maps integers 0, 1, . . . , k −1 to bucket numbers

0 through k −1 typically will map some pairs of integers to the same bucket and

leave other buckets unﬁlled. However, the diﬀerence is unimp ortant as long as

剩余58页未读，继续阅读

Chuang_Viss

粉丝: 0
资源: 1

相似项查找：从数据挖掘到文本相似度

Learning Data Mining with Python - Second Edition

4本经典Data Mining电子书pdf

DATAMINING

dataMining

Datamining

DataMining

Data mining

data mining

datamining

DATA MINING

最新资源