Research on Distributed Text Clustering Based on Frequent Itemset
Wenchuan Yang
1
, Qiwei Wu
1,2
, Zishuai Cheng
2
School of Network Security, Beijing University of Posts and Telecommunication, Beijing,100876, China
E-mail: 876196774@qq.com
Abstract: Text clustering, as a significant field in natural language processing, is a key technology of processing and organizing
massive text data. In the era of big data, however, the massiveness of data brings great challenge in aspects of time and accuracy
of text clustering. This paper focus on the issue of speed and preciseness in text clustering combined with genetic algorithm,
feedback and distributed computing. A distributed text clustering method is proposed, and it is based on frequent Itemset. The
examination result shows it can find out the global optimal centers more efficiently and make the clustering most accurate.
Key Words: Text clustering, Frequent Itemset, Correlation analysis, Hadoop
1 Introduction
Text clustering technology, is a key technology in the
document's data processing and organization. It is also the
vital role in the field of natural language processing. With
the rapid development of modern information technology,
especially the rising popularity of electronic publications, all
the scientific publication can be in the form of electronic
information. The most of the information of article
publication are stored in the form of text and display. In the
face of these vast amounts of scientific literature text, to
achieve rapid and accurate text clustering, is not only
important for the access to information and document
classification, but also in recommender system,
collaborative filtering, search engines, such as natural
language recognition field [1]. The improvement of text
clustering technology will also cause the innovation of the
information processing technology.
Nowadays, information technology is increasingly
expanding, and a variety of interdisciplinary scientific and
technological inventions appear in the industry. The
traditional clear classification of various disciplines have
been unable to meet the current limit of the field of liberal
arts , science and technology. On the other hand, due to the
penetration of interdisciplinary knowledge, the traditional
subject classification by keyword filtering technology has
failed to achieve the desired effect. In view of the present
requirements, we must follow the dynamic development of
science and technology to de the document filing and
discipline division in order to meet the needs of the industry.
Therefore, it can be seen that excellent massive amounts of
text clustering algorithm is an urgent need of technology.
However, with the change of text clustering technology
application scenarios, text clustering still has good
development prospects and challenges to be solved. So we
must design a text clustering algorithm which has efficiency
calculation of the similarity of clustering calculation, and
considering the computation speed of mass text, in order to
meet the current demand for this field. This topic design of
massive amounts of text clustering algorithm based on
frequent itemsets, is a kind of new distributed text clustering
algorithm. With the correlation analysis thought, it enhance
the efficiency and effectiveness of clustering.The distributed
*
This paper is supported by the National Natural Science Foundation of
China (No. 61571064,61471060,61370176).
parallel clustering and the frequent itemsets in text clustering
thought will become the focus of the research object in this
area in the future.
2 Technical Analysis
2.1 Frequent itemsets rukes
Frequent itemsets mining is mainly applied in the field of
correlation analysis.its main purpose is to find potential
contact and co-occurrence relationship in a large amount of
data. Theoretically, association rules are used to describe the
phenomenon of co-occurrence quantitative statistics of the
different elements in the same type of event. Basic theory to
describe as follows:
Definition 1:Itemset
Set I = {i1, i2,... Im} is a collection of items, T = {t1, t2,...
Tn} is a collection of transaction.Each transaction ti is a
collection of items, and meet the ti ك I.an association rule is
a form of the following contains the relationship of X and Y,
where X ؿ I, Y ؿ I and XתY = . (X or Y) is a collection
of the project, called itemsets[2].
Definition 2:Support
If a itemset X is a subset of the transaction tiאT, says ti
containing X. The support count of X in T (expressed as
Xήcount) is the number of transactions that contain X in T.
The support of the rule X՜Y is refered to the percentage of
XUY in the transaction of T. Support=
ሺሻή௨௧
.
Definition 3:Confidence
The confidence of the rule X ՜ Y is refered to the
percentage of XUY in all the transaction of T which include
X. Support=
ሺሻή௨௧
ή௨௧
.
Frequent itemsets mining is divided into two process.We
need to find that minimum frequent itemsets which meet the
co-occurrence relation (that is, the frequent binomial
itemsets).Then we should get these frequent items that meet
the threshold of minimum confidence .According to the
mining process of the frequent binomial itemsets, we can dig
the trinomial itemsets from the frequent binomial
itemsets.Then we need to screen the trinomial itemsets
Proceedings of the 36th Chinese Control Conference
Jul
26-28, 2017, Dalian, China
5700