基于真正共同短语的全新网络搜索结果聚类

下载需积分: 0 | PDF格式 | 389KB | 更新于2025-01-29 | 160 浏览量 | 举报

在现代信息爆炸的时代，搜索引擎已成为人们获取知识、信息的主要工具之一。对于用户来说，有效地管理和组织搜索结果是至关重要的，因为这直接影响到检索效率。本文探讨了一种新的Web搜索结果聚类方法，其核心在于"基于真共同短语标签的Web搜索结果聚类"（ANewWebSearchResultClusteringbasedonTrueCommonPhraseLabel）。传统的搜索引擎结果聚类方法，如Suffix Tree Clustering (STC)，虽然具有快速自动聚类和标签分配的优点，但存在一个主要问题：由于依赖n-gram技术，它可能会产生不连贯的集群标签。这种中断的标签可能使用户难以理解文档之间的真正关联，从而降低了用户体验。为解决这个问题，研究人员提出了一种创新的解决方案。首先，他们设计了一种新的后缀树数据结构，这种数据结构具有在线性和渐进的构建算法，使其适用于实时的Web搜索结果聚类。这种设计允许系统在处理大量数据时保持高效，同时能够动态适应不断变化的搜索请求。其次，他们引入了一种新的基础集群合并算法，结合了新颖的局部片段连接操作。这种方法旨在寻找真正的共同短语，即文档中频繁出现且能准确反映文档主题或内容的关键短语，作为集群的标识。这样生成的标签更加直观和有意义，有助于用户更快地找到他们感兴趣的信息。相比于传统的n-gram技术，新提出的算法在保持聚类速度的同时，提高了标签的一致性和准确性。通过将文档的真正共同短语作为集群的标签，用户不仅可以快速定位到相关的搜索结果，而且可以更好地理解和解读搜索结果的内在联系。这篇论文为Web搜索结果聚类提供了一种更为精细和用户友好的策略，利用新的数据结构和算法优化了搜索结果的组织和呈现，从而提高了用户的检索效率和满意度。这对于搜索引擎的设计者和开发者来说，无疑是一大进步，也是信息技术领域的一个重要贡献。未来的研究可能进一步探索如何将这些技术与深度学习、自然语言处理等先进技术结合，以实现更智能、个性化的搜索体验。

展开

A New Web Search Result Clustering based on True Common Phrase Label

Discovery

Jongkol Janruang * and Worapoj Kreesuradej**

Faculty of Information Technology

King Mongkut’s Institute of Technology Ladkrabang

Bankok, 15320 Thailand

Email: tawan48@gmail.com*

and worapoj@it.kmitl.ac.th**

Abstract

Web search results clustering are navigator

for users to search results. Therefore the correct cluster

label is important which has been index the set of web

document. Suffix Tree Clustering (STC) is fast

automatically clustering and labeling. However, STC

is inadequate since they generate interrupted cluster

label due to using n-gram technique. In this paper, we

propose an approach for web search results clustering

and labeling based on a new suffix tree data structure,

a new base cluster combining algorithm with a new

partial phase join operation. The algorithm for

constructing the data structure is an incremental and a

linear time algorithm. Thus, the proposed approach is

suitable for on-the-fly the web search results clustering

and labeling cluster. The proposed approach provides

more readable and true common phrase of web

document cluster than conventional web search result

clustering. Experimental results also show that the

proposed approach has better performance than that of

conventional web search result clustering.

Keyword: web search results clustering, incremental

clustering, content based combining, a new suffix tree.

1. Introduction

Several approaches underneath description-

comes-first concept such as web search results

clustering approaches using Lingo algorithm [1, 2, 3],

SHOC [4] and FIHC [5]. That is not incremental

clustering algorithm.

Unlike the other algorithms, suffix tree

clustering (STC) algorithm, which is an algorithm for

clustering search results, is an incremental algorithm.

Therefore, web search result clustering based on this

algorithm is a promising approach to work on a long

list of snippets returned by search engines. The

original STC algorithm can often construct a long path

of suffix tree, particularly when the same snippets are

feed to the STC algorithm [7, 8, 9, 10, 11, 12]. Hau-

Jun Zeng and etc. [12] introduced an improved suffix

tree with n-gram to deal with the problem of the

original suffix tree. However, the suffix tree with n-

gram can discover only partial common phases when

the length of n-gram is shorter than the length of true

common phases. As an example, Given that a true

common phase is “President William Jefferson

Clinton”, a suffix tree with 2-gram can discover partial

common phases: “President William”, “William

Jefferson” and “Jefferson Clinton.” If this is the case,

STC with n-gram give too many base clusters. In

addition, a cluster label obtained from STC with n-

gram can not be a true common phase when the length

of n-gram is shorter than the length of true common

phases.

Here, this paper proposes a new approach for

web search result clustering to deal with such

problems. The new approach still uses a suffix tree

with n-gram. However, the approach also introduces a

new base cluster combining technique with a new

partial phase join operation to find a true common

phase. The new approach provides more precision and

true common phrase of web document cluster than the

approach that is based on the previous STC algorithms.

2. A New Web Search Result Clustering

based on True Common Phrase Label

Discovery

Clustering web search results is to group each

snippet, returned by search engines, with others

sharing a common content and to generate new cluster

labels from interrupted cluster label due to using n-

gram technique. The new approach composes of four

phases which the algorithm is given in figure 1.

International Conference on Computational Intelligence for Modelling

Control and Automation,and International Conference on

Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)

Authorized licensed use limited to: Xiamen University. Downloaded on October 7, 2008 at 9:8 from IEEE Xplore. Restrictions apply.

下载后可阅读完整内容，剩余5页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

jason_ye02

粉丝: 0

基于真正共同短语的全新网络搜索结果聚类

Harmony search based clustering algorithm:Harmony search based clustering on合成数据集-matlab开发

A three-phase approach to document clustering based on topic significance degree

Web Items Recommendation Based on Multi-View Clustering

Uncertain information clustering based on distance between BPAs

A sampling method based on URL clustering for fast web accessibility evaluation

A Hybrid Clustering System based, (DE) Algorithm for Clustering:A Hybrid Clustering System based on, (DE) Algorithm for Clustering-matlab开发

A Link Clustering Based Approach for Clustering Categorical Data

An Energy-Efficient Clustering algorithm Based on Cross-Monotonic Cost Sharing Game

matlab代码粒子群算法-Fuzzy-clustering-based-on-FOA:Matlab中基于森林优化算法的模糊聚类

Building Chinese event type paradigm based on trigger clustering

最新资源