
A New Web Search Result Clustering based on True Common Phrase Label
Discovery
Jongkol Janruang * and Worapoj Kreesuradej**
Faculty of Information Technology
King Mongkut’s Institute of Technology Ladkrabang
Bankok, 15320 Thailand
Email: tawan48@gmail.com*
and worapoj@it.kmitl.ac.th**
Abstract
Web search results clustering are navigator
for users to search results. Therefore the correct cluster
label is important which has been index the set of web
document. Suffix Tree Clustering (STC) is fast
automatically clustering and labeling. However, STC
is inadequate since they generate interrupted cluster
label due to using n-gram technique. In this paper, we
propose an approach for web search results clustering
and labeling based on a new suffix tree data structure,
a new base cluster combining algorithm with a new
partial phase join operation. The algorithm for
constructing the data structure is an incremental and a
linear time algorithm. Thus, the proposed approach is
suitable for on-the-fly the web search results clustering
and labeling cluster. The proposed approach provides
more readable and true common phrase of web
document cluster than conventional web search result
clustering. Experimental results also show that the
proposed approach has better performance than that of
conventional web search result clustering.
Keyword: web search results clustering, incremental
clustering, content based combining, a new suffix tree.
1. Introduction
Several approaches underneath description-
comes-first concept such as web search results
clustering approaches using Lingo algorithm [1, 2, 3],
SHOC [4] and FIHC [5]. That is not incremental
clustering algorithm.
Unlike the other algorithms, suffix tree
clustering (STC) algorithm, which is an algorithm for
clustering search results, is an incremental algorithm.
Therefore, web search result clustering based on this
algorithm is a promising approach to work on a long
list of snippets returned by search engines. The
original STC algorithm can often construct a long path
of suffix tree, particularly when the same snippets are
feed to the STC algorithm [7, 8, 9, 10, 11, 12]. Hau-
Jun Zeng and etc. [12] introduced an improved suffix
tree with n-gram to deal with the problem of the
original suffix tree. However, the suffix tree with n-
gram can discover only partial common phases when
the length of n-gram is shorter than the length of true
common phases. As an example, Given that a true
common phase is “President William Jefferson
Clinton”, a suffix tree with 2-gram can discover partial
common phases: “President William”, “William
Jefferson” and “Jefferson Clinton.” If this is the case,
STC with n-gram give too many base clusters. In
addition, a cluster label obtained from STC with n-
gram can not be a true common phase when the length
of n-gram is shorter than the length of true common
phases.
Here, this paper proposes a new approach for
web search result clustering to deal with such
problems. The new approach still uses a suffix tree
with n-gram. However, the approach also introduces a
new base cluster combining technique with a new
partial phase join operation to find a true common
phase. The new approach provides more precision and
true common phrase of web document cluster than the
approach that is based on the previous STC algorithms.
2. A New Web Search Result Clustering
based on True Common Phrase Label
Discovery
Clustering web search results is to group each
snippet, returned by search engines, with others
sharing a common content and to generate new cluster
labels from interrupted cluster label due to using n-
gram technique. The new approach composes of four
phases which the algorithm is given in figure 1.
International Conference on Computational Intelligence for Modelling
Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006
Authorized licensed use limited to: Xiamen University. Downloaded on October 7, 2008 at 9:8 from IEEE Xplore. Restrictions apply.