最小方向散列三集最大似然估计优化

194 浏览量更新于2024-08-26 收藏 1.16MB PDF 举报

本文主要探讨了三组最小方向散列（Minwise Hashing）在最大似然估计（Maximum Likelihood Estimation,MLE）中的应用。在计算多集（三组数据集）之间的相似性时，当三个集合的大小关系比较接近（如f1≈f2≈f3），Minwise Hash及其变体提供了高效且准确的相似度测量方法。然而，当这三个集合之间的相似性和包含关系不平衡，例如f1远大于f2和f3（f1>>f2≈f3≈a），传统的Minwise Hash方法的方差会变得过大，这可能导致准确性降低。为了解决这个问题，作者提出了针对三组数据集的Hash的极大似然估计方法。该方法通过考虑比较过程中各种事件的概率组合，试图优化平均精度。这种方法的目标是减少在低相似度和高包含度情况下估算误差。作者基于理论推导和实验结果，展示了这种改进的Hash算法在处理这类复杂场景时能够显著提升估计的准确性。具体而言，研究涉及以下步骤和内容： 1. **理论背景**：首先回顾了Minwise Hash的基本原理，包括其在衡量集合相似性方面的优点，以及它如何在相似度接近的情况下工作。 2. **问题识别**：指出了在大小差异较大的三组集合中使用传统方法的局限性，强调了在估计上的挑战。 3. **方法提出**：设计了一种新的极大似然估计策略，考虑了三组数据集之间的交互效应，以减小误差。这可能涉及到概率模型的建立，比如联合概率分布的估计。 4. **模型优化**：通过数学推导和统计分析，优化了模型参数，以最大化似然函数，从而提高估计的准确性。 5. **实验验证**：通过实际数据集的实验，展示了新方法与传统方法相比，在不同相似性和包含关系下的性能提升。这可能包括对比准确率、召回率、F1分数等指标。 6. **结论与应用**：总结了研究成果，并讨论了这项工作的潜在应用，特别是在大数据集或复杂关系的相似度分析中。这篇论文提供了一个有效的解决方案，帮助解决在三组数据集相似度估计中遇到的难题，特别适用于那些大小和包含关系不均衡的情况。通过引入最大似然估计，研究人员能够更好地量化和管理不确定性，从而提升整体的相似度评估质量。

ICIC Express Letters ICIC International

2015 ISSN 1881-803X

Volume 9, Number 7, July 2015 pp. 2039–2044

MAXIMUM LIKELIHOOD ESTIMATOR OF MINWISE HASHING

FOR THREE SETS

Xinpan Yuan

, Xinhai Sheng

, Jun Long

2,∗

, Zuping Zhang

Changyun Li

and Junfeng Man

School of Computer and Communication

Hunan University of Technology

No. 218, Daping Road, Hetang Ditrict, Zhuzhou 412000, P. R. China

Scho ol of Information Science and Engineering

Central South University

No. 932, Lushan South Road, Changsha 410083, P. R. China

∗

Corresponding author: jlong@csu.edu.cn

Received July 2014; accepted October 2014

Abstract. Computing similarity of three sets or multi sets is a fundamental problem.

Minwise Hash and its many variants are eﬃcient and accurate methods of similarity

measure when the size of the three collections is most same (e.g., f

≈ f

). How-

ever, with low resemblance and high containment (e.g., f

>> f

≈ f

≈ a), the variance

is too big. Combining probability of various events in the comparison, we propose the

maximum likelihood estimator of Hash for three sets to improve the average accuracy,

and experimental results demonstrate the eﬀectiveness of this estimator.

Keywords: Similarity estimation, Maximum likelihood estimator, Three set resem-

blance, Hash

1. Introduction. The explosive expansion of the World Wide Web has resulted in more

than 20% redundant web documents [1] and thus created a serious problem for Internet

search engines. Duplicate document detection has signiﬁcant applications in intellectual

property protection and information retrieval. In the similarity measure of collection, stor-

age and eﬃciency of massive data are important factors to restrict the measure, therefore,

detection of estimation has emerged to become the mainstream measure method.

Minwise Hash [2] is commonly used to estimate the similarity of the collection, which has

a wide range of applications and signiﬁcance in the words correlation [3], data cleaning [4],

data mining [5], duplicated web pages removal [6], wireless sensor networks [7], text reuse

[8] and other ﬁelds. Minwise Hash has also been considerable innovation and development

of theoretical and experimental methods [9-15]. b-bit Minwise Hash [12] can reduce storage

space. Fractional-bit Minwise Hash [13] has a wide range of selectivity for accuracy and

storage space requirements. Connected bit Minwise Hash [14] can exponentially reduce

the number of comparison to improve the performance of the algorithm.

The literatures on Minwise Hash has mainly focused on the accuracy of estimation.

With low resemblance and high containment, the variance of Minwise Hash and its many

variants is too big, so the accuracy is low. We try to combine the probability of seven

events in comparison, and propose a maximum likelihood estimator [15] of Minwise Hash

for three sets to increase the accuracy.

The organization of the paper is as follows. Section 2 introduces the Minwise Hash and

maximum likelihood estimator. Section 3 explains how to establish maximum likelihood

estimator for three sets. Section 4 discusses experiment results. Section 5 concludes the

paper.

2039

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38520258

粉丝: 4
资源: 904

最小方向散列三集最大似然估计优化

maximum likelihood estimation_最大似然估计_

ML估计_信道检测最大似然估计法_mimo_

理解一般线性模型GLMs：从最小二乘到最大似然估计

r语言最小二乘法求最大似然估计

最小二乘法最大似然估计

最小二乘法最大似然估计矩阵

最小二乘法与 最大似然法的参数辨识

FittFunc.rar_em 算法 matlab_最大似然估计_最小二乘_最小二乘 拟合_高斯拟合算法

基于改进标准化最大似然估计的最小描述长度降噪方法 (2014年)

参数估计：矩估计法与最大似然估计

最新资源

最小二乘法与最大似然法的参数辨识

FittFunc.rar_em 算法 matlab_最大似然估计_最小二乘_最小二乘拟合_高斯拟合算法