大数据时代下，基于频繁项集的分布式文本聚类提升效率与精度

21 浏览量更新于2024-08-28 收藏 333KB PDF 举报

"基于频繁项集的分布式文本聚类研究"是一篇探讨在大数据时代背景下，如何提高文本聚类的效率与精确度的重要研究论文。自然语言处理领域中的文本聚类技术，对于大规模文本数据的处理和组织具有核心价值。然而，面对海量数据，如何在时间和准确性上取得平衡成为一个严峻挑战。论文作者Wenchuan Yang、Qiwei Wu 和 Zishuai Cheng来自北京邮电大学网络安全学院，他们关注的是结合遗传算法、反馈机制以及分布式计算的文本聚类问题。传统的方法往往在处理大规模数据时难以满足实时性和精度的要求。为此，他们提出了一种新的分布式文本聚类方法，该方法的基础是频繁项集理论。频繁项集是指在大量数据集中频繁出现的子集，通过利用这些频繁模式，可以更有效地挖掘文本数据的内在结构和关联性。论文的核心内容包括以下几个方面： 1. 引言部分阐述了文本聚类技术的重要性，特别是在文档管理和自然语言处理领域的应用。随着大数据的兴起，如何在海量文本数据中快速且准确地进行分类成为亟待解决的问题。 2. 针对这一挑战，研究者提出了一种创新的解决方案，将遗传算法融入到基于频繁项集的文本聚类中。遗传算法能够通过模拟自然选择和遗传机制，寻找最优解，而频繁项集则提供了数据中的关键特征，帮助减少搜索空间，提高效率。 3. 方法论部分详细描述了分布式文本聚类的具体实现过程，通过分布式计算框架（如Hadoop），将任务分解到多个节点上并行处理，进一步提升了处理速度。同时，通过反馈机制优化聚类结果，确保了聚类的准确性。 4. 实验部分展示了新方法在实际数据集上的性能，通过对比传统的文本聚类算法，证明了新方法在速度和精度上具有显著优势，尤其是在大规模文本数据集上的表现更为突出。 5. 结论部分总结了研究的主要成果，强调了基于频繁项集的分布式文本聚类方法在应对大数据挑战方面的优势，并展望了未来可能的研究方向，如进一步优化算法效率和扩展到其他自然语言处理任务。这篇研究论文提供了一种有效的策略，以解决大数据环境下文本聚类面临的瓶颈，为文本挖掘和自然语言处理领域的实际应用开辟了新的途径。

Research on Distributed Text Clustering Based on Frequent Itemset

Wenchuan Yang

, Qiwei Wu

1,2

, Zishuai Cheng

School of Network Security, Beijing University of Posts and Telecommunication, Beijing,100876, China

E-mail: 876196774@qq.com

Abstract: Text clustering, as a significant field in natural language processing, is a key technology of processing and organizing

massive text data. In the era of big data, however, the massiveness of data brings great challenge in aspects of time and accuracy

of text clustering. This paper focus on the issue of speed and preciseness in text clustering combined with genetic algorithm,

feedback and distributed computing. A distributed text clustering method is proposed, and it is based on frequent Itemset. The

examination result shows it can find out the global optimal centers more efficiently and make the clustering most accurate.

Key Words: Text clustering, Frequent Itemset, Correlation analysis, Hadoop



1 Introduction

Text clustering technology, is a key technology in the

document's data processing and organization. It is also the

vital role in the field of natural language processing. With

the rapid development of modern information technology,

especially the rising popularity of electronic publications, all

the scientific publication can be in the form of electronic

information. The most of the information of article

publication are stored in the form of text and display. In the

face of these vast amounts of scientific literature text, to

achieve rapid and accurate text clustering, is not only

important for the access to information and document

classification, but also in recommender system,

collaborative filtering, search engines, such as natural

language recognition field [1]. The improvement of text

clustering technology will also cause the innovation of the

information processing technology.

Nowadays, information technology is increasingly

expanding, and a variety of interdisciplinary scientific and

technological inventions appear in the industry. The

traditional clear classification of various disciplines have

been unable to meet the current limit of the field of liberal

arts , science and technology. On the other hand, due to the

penetration of interdisciplinary knowledge, the traditional

subject classification by keyword filtering technology has

failed to achieve the desired effect. In view of the present

requirements, we must follow the dynamic development of

science and technology to de the document filing and

discipline division in order to meet the needs of the industry.

Therefore, it can be seen that excellent massive amounts of

text clustering algorithm is an urgent need of technology.

However, with the change of text clustering technology

application scenarios, text clustering still has good

development prospects and challenges to be solved. So we

must design a text clustering algorithm which has efficiency

calculation of the similarity of clustering calculation, and

considering the computation speed of mass text, in order to

meet the current demand for this field. This topic design of

massive amounts of text clustering algorithm based on

frequent itemsets, is a kind of new distributed text clustering

algorithm. With the correlation analysis thought, it enhance

the efficiency and effectiveness of clustering.The distributed

This paper is supported by the National Natural Science Foundation of

China (No. 61571064,61471060,61370176).

parallel clustering and the frequent itemsets in text clustering

thought will become the focus of the research object in this

area in the future.

2 Technical Analysis

2.1 Frequent itemsets rukes

Frequent itemsets mining is mainly applied in the field of

correlation analysis.its main purpose is to find potential

contact and co-occurrence relationship in a large amount of

data. Theoretically, association rules are used to describe the

phenomenon of co-occurrence quantitative statistics of the

different elements in the same type of event. Basic theory to

describe as follows:

Definition 1:Itemset

Set I = {i1, i2,... Im} is a collection of items, T = {t1, t2,...

Tn} is a collection of transaction.Each transaction ti is a

collection of items, and meet the ti ك I.an association rule is

a form of the following contains the relationship of X and Y,

where X ؿ I, Y ؿ I and XתY = ׎. (X or Y) is a collection

of the project, called itemsets[2].

Definition 2:Support

If a itemset X is a subset of the transaction tiאT, says ti

containing X. The support count of X in T (expressed as

Xήcount) is the number of transactions that contain X in T.

The support of the rule X՜Y is refered to the percentage of

XUY in the transaction of T. Support=

ሺ௑׫௒ሻή௖௢௨௡௧

௡

Definition 3:Confidence

The confidence of the rule X ՜ Y is refered to the

percentage of XUY in all the transaction of T which include

X. Support=

ሺ௑׫௒ሻή௖௢௨௡௧

௑ή௖௢௨௡௧

Frequent itemsets mining is divided into two process.We

need to find that minimum frequent itemsets which meet the

co-occurrence relation (that is, the frequent binomial

itemsets).Then we should get these frequent items that meet

the threshold of minimum confidence .According to the

mining process of the frequent binomial itemsets, we can dig

the trinomial itemsets from the frequent binomial

itemsets.Then we need to screen the trinomial itemsets

Proceedings of the 36th Chinese Control Conference

Jul

26-28, 2017, Dalian, China

5700

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38704701

粉丝: 8
资源: 981

大数据时代下，基于频繁项集的分布式文本聚类提升效率与精度

基于分布式框架下的中文文本特征分类.pdf

基于Spark的主动重叠K-means聚类算法.docx

Carrot2：基于Lucene的开源搜索结果聚类工具详解

CS-Chord:优化聚类分离的分布式高维向量索引技术

基于频繁词集的微博新话题快速发现算法研究

Carrot2搜索结果聚类引擎详解

Carrot2聚类工具详解与应用

基于Solr的文本分类与聚类技术

数据挖掘算法在文本聚类中的应用：文本相似性分析，文档组织

Word2Vec词嵌入在文本聚类中的应用：文本数据分组，发现隐藏模式

最新资源