XML关键词搜索与相关性排名

需积分: 10 92 浏览量更新于2024-09-21 1 收藏 442KB PDF 举报

"Relevance Ranking在XML搜索中的应用与挑战" XML（可扩展标记语言）是一种用于存储和传输结构化数据的标准格式，特别是在web服务和大数据处理中广泛应用。随着信息检索（IR）技术在网页搜索上的成功，XML数据库也开始采用关键词搜索作为查询手段。然而，XML数据库与传统的文本数据库存在显著差异，这带来了三个主要挑战： 1. **识别用户搜索意图**：XML数据具有层次结构，用户可能希望根据节点类型进行搜索。因此，理解用户是想查找特定类型的XML节点（如元素、属性等），还是想通过这些节点进行搜索，是一项关键任务。 2. **解决关键词歧义问题**：在XML文档中，一个关键词可能同时作为标签名和文本值出现，也可能在不同类型的节点中具有不同的含义。例如，"apple"可能代表一个商品名称，也可能指代水果。解析这种歧义以提供准确的结果是一项复杂的工作。 3. **评估子树的相关性**：由于搜索结果通常以XML文档的子树形式返回，因此需要新的评分函数来评估这些子树相对于查询的关联度。传统的IR方法往往无法有效地处理这种结构性数据的复杂性。针对这些问题，论文"Effective XML Keyword Search with Relevance Oriented Ranking"提出了一个信息检索风格的方法。该方法旨在通过以下方式改进XML搜索的质量： - **利用上下文信息**：通过分析关键词出现的上下文，比如其所在节点的位置和结构，以更好地理解用户的搜索意图。 - **处理关键词歧义**：可能使用词性标注和语义分析来区分同一关键词在不同上下文中的含义，从而减少歧义。 - **开发新的相关性评分机制**：设计一种新的评分函数，它考虑了XML文档的结构特性，包括节点的位置、深度以及与查询关键词的关系，以确定子树的相关性。该方法的目标是提高查询结果的关联度，从而提升用户对搜索结果满意度。通过这些策略，论文作者期望能够克服现有方法的局限，提供更高质量的XML搜索体验。

main contributions are summarized as follows:

1) This is the ﬁrst work that addresses the keyword ambi-

guity problem. We also identify three crucial issues that

an effective XML keyword search engine should meet.

2) We deﬁne our own XML TF (term frequency) and XML

DF (document frequency), which are cornerstones of all

formulae proposed later.

3) We propose three important guidelines in identifying the

user desired search for node type, and design a formula

to compute the conﬁdence level of a certain node type

to be a desired search for node based on the guidelines.

4) We design formulae to compute the conﬁdence of each

candidate node type as the desired search via node to

model natural human intuitions, in which we take into

account the pattern of keywords co-occurrence in query.

5) We propose a novel relevance oriented ranking scheme

called XML TF*IDF similarity which can capture the

hierarchical structure of XML, and resolve Ambiguity 1

and Ambiguity 2 in a heuristic way; and also distinguish

the similarity computation for leaf nodes and internal

nodes in XML data. Moreover, our approach is able to

handle both semi-structured and unstructured data.

6) We implement the proposed techniques in a keyword

search engine prototype called XReal. Extensive exper-

iments show its effectiveness, efﬁciency and scalability.

The rest of the paper is organized as follows. We present

the related work in Section II, and preliminary on IR and data

model in Section III. Section IV infers user search intention,

and Section V discusses relevance oriented ranking. Section

VI presents the search algorithms. Experimental evaluation is

given in Section VII and we conclude in Section VIII.

II. R

ELATED WORK

Extensive research efforts have been conducted in XML

keyword search to ﬁnd the smallest sub-structures in XML

data that each contains all query keywords in either the tree

data model or the directed graph (i.e. digraph) data model.

In tree data model, LCA (lowest common ancestor) seman-

tics is ﬁrst proposed and studied in [8], [2] to ﬁnd XML nodes,

each of which contains all query keywords within its subtree.

Subsequently, SLCA (smallest LCA [9], [3]) is proposed to

ﬁnd the smallest LCAs that do not contain other LCAs in their

subtrees. GDMCT (minimum connecting trees) [5] excludes

the subtrees rooted at the LCAs that do not contain query

keywords. Sun et al. [10] generalize SLCA to support key-

word search involving combinations of AND and OR boolean

operators. XSeek [4] generates the return nodes which can be

explicitly inferred by keyword match pattern and the concept

of entities in XML data. However, it addresses neither the

ranking problem nor the keyword ambiguity problem. Besides,

it relies on the concept of entity (i.e. object class) and considers

a node type t in DTD as an entity if t is “*”-annotated in DTD.

As a result, customer, phone, interest, book in Figure 1,

are identiﬁed as object classes by XSeek. However, it causes

the multi-valued attribute to be mistakenly identiﬁed as an

entity, causing the inferred return node not as intuitive as

possible. E.g. phone and interest are not intuitive as entities.

In fact, the identiﬁcation of entity is highly dependent on the

semantics of the underlying database rather than its DTD, so

it usually requires the veriﬁcation and decision from database

administrator. Therefore, the adoption of entities for keyword

search should be optional although this concept is very useful.

In digraph data model, previous approaches are heuristics-

based, as the reduced tree problem on graph is as hard as

NP-complete. Li et al. [11] show the reduction from minimal

reduced tree problem to the NP-complete Group Steiner Tree

problem on graphs. BANKS [12] uses bidirectional expansion

heuristic algorithms to search as small portion of graph as

possible. BLINKS [13] proposes a bi-level index to prune and

accelerate searching for top-k results in digraphs. Cohen et

al. [14] study the computation complexity of interconnection

semantics. XKeyword [15] provides keyword proximity search

that conforms to an XML schema; however, it needs to com-

pute candidate networks and thus is constrained by schemas.

On the issue of result ranking, XRANK [2] extends

Google’s PageRank to XML element level, to rank among

the LCA results; but no empirical study is done to show the

effectiveness of its ranking function. XSEarch [1] adopts a

variant of LCA, and combines a simple tf*idf IR ranking with

size of the tree and the node relationship to rank results; but it

requires users to know the XML schema information, causing

limited query ﬂexibility. EASE [16] combines IR ranking and

structural compactness based DB ranking to fulﬁll keyword

search on heterogenous data. Regarding to ranking methods,

TF*IDF similarity [7] which is originally designed for ﬂat

document retrieval is insufﬁcient for XML keyword search due

to XML’s hierarchical structure and the presence of Ambiguity

1 and Ambiguity 2. Several proposals for XML information

retrieval suggest to extend the existing XML query languages

[17], [18], [19] or use XML fragments [20] to explicitly

specify the search intention for result retrieval and ranking.

III. P

RELIMINARIES

A. TF*IDF cosine similarity

TF*IDF (Term Frequency * Inverse Document Frequency)

similarity is one of the most widely used approaches to

measure the relevance of keywords and document in keyword

search over ﬂat documents. We ﬁrst review its basic idea, then

address its limitations for keyword search in XML. The main

idea of TF*IDF is summarized in the following three rules.

• Rule 1: A keyword appearing in many documents should

not be regarded as being more important than a keyword

appearing in a few.

• Rule 2: A document with more occurrences of a query

keyword should not be regarded as being less important

for that keyword than a document that has less.

• Rule 3: A normalization factor is needed to balance be-

tween long and short documents, as Rule 2 discriminates

against short documents which may have less chance to

contain more occurrences of keywords.

To combine the intuitions in the above three rules, the

TF*IDF similarity is designed:

519519

Authorized licensed use limited to: Dalian University of Technology. Downloaded on March 10,2010 at 04:23:52 EST from IEEE Xplore. Restrictions apply.

剩余11页未读，继续阅读

cammay

粉丝: 0

XML关键词搜索与相关性排名

Ranking Relevance in Yahoo Search

Relevance Ranking for Vertical Search Engines

01-Elasticsearch Relevance Engine - meetup 11.16 2023

01-Elasticsearch Relevance Engine meetup 上海 10.21 2023

The Probabilistic Relevance Framework - BM25 and Beyond-计算机科学

Your Relevance Feedback Is Essential: Enhancing the Learning to Rank Using the Virtual Feature Based Logistic Regression

vue.js v2.5.17

DM8-SQL语言详解及其数据管理和查询操作指南

1108_ba_open_report.pdf

anslow_02_0109.pdf

最新资源