XDist：XML关键字搜索的改进系统，利用分布重排提升性能

PDF格式 | 4.5MB | 更新于2024-07-15 | 25 浏览量 | 举报

《XDist：一种基于关键词分布的高效XML搜索系统》是一篇发表于《中国科学：信息科学》的研究论文，探讨了如何改进传统关键字搜索在XML数据中的性能。XML作为一种复杂的数据格式，用户往往希望通过简单的关键字输入来检索信息，但关键字搜索的固有歧义性使得精确地找到相关结果变得困难。论文关注的是提升搜索系统的准确性和效率，针对当前广泛使用的统计排序方法（如TF-IDF和BM25）存在的局限——主要依赖词频、文档逆频率和长度等因素，忽视了不同关键字之间的分布和关联信息。作者提出了一种新的搜索系统XDist，该系统采用了一个两阶段的策略。首先，利用语义查询模型MAXLCA（最大最小公共祖先）来确定查询的潜在相关结果。MAXLCA考虑了关键字在XML文档结构中的上下文关系，有助于缩小搜索范围。然后，这些初步结果会按照传统的BM25算法进行排名，这是一个经典的倒排索引技术，它考虑了文档的全局统计信息。然而，XDist的独特之处在于其引入了组合分布度量（CDM）进行后续的重新排序。CDM综合了四个关键度量标准：1) 术语接近度，衡量关键字在文档中的紧密度；2) 关键字类别的交集，强调相关类别的重要性；3) 关键字之间的集成度，考量关键字的联合相关性；4) 关键字数量方差，反映关键字分布的均匀性。这四个度量的权重并非固定，而是通过机器学习的方法，特别是列表学习来动态调整，以适应不同的查询和数据特性。重新排序的目的是在早期的BM25排名基础上，根据关键字的实际分布情况和它们之间的关系更精细地调整结果顺序。这样做的效果在INEX评估平台上得到了验证，结果显示，CDM重排方法显著提高了搜索性能，特别是在IP[0.01]指标下，能够有效地减少误检和漏检，提高了检索结果的质量和精度。 XDist通过结合语义分析、统计排序和分布度量，提供了一种有效且智能的XML关键字搜索解决方案，这对于处理大量复杂XML数据并满足用户对于高效、准确搜索的需求具有重要意义。

Gao N, et al. Sci China Inf Sci May 2014 Vol. 57 052107:4

extends the PageRank hyperlink metric to XML ranking. Apparently, PageRank does not deal with the

ambiguity problem. In XRANK, a node in XML tree is designated as a result node only if it contains

at least one occurrence of each keyword in its subtre e, after excluding the nodes in its desce ndants that

already contain all the keywords. The formal deﬁnition is a s follow:

Deﬁnition 1. Given a keyword query Q = {k

, . . . , k

}, an XML tree Xtree, we assume that V

(1 <=

i <= q) is the set of nodes that directly contains keyword k

in Xtre e. LCA(Q, Xtree) is deﬁned as

LCA(Q, Xtree) = {n|∃(v

∈ V

, . . . , v

∈ V

), n is the lowest common ancestors of {v

, . . . , v

Deﬁnition 2. Given a keywor d query Q = {k

, . . . , k

}, an XML tree Xtree, XRANK(Q, Xtree) is

deﬁned as follows: XRANK(Q, Xtree) = n|n

∗

∈ LCA(Q, Xtre e ), where n∗ is node n after excluding all

its descendant nodes which belong to LCA node set.

XKSearch was put forward by Xu et al. [15]. XKSearch deﬁnes SLCA as query semantic model. For

a query, a node in XML tree is considered as a result node in SLCA only if it contains at least one

occurrence of each keyword in its subtre e , and none of its descendants does. Howe ver, XKSearch does

not address the ra nking problem. The formal deﬁnition of SLCA is as follow:

Deﬁnition 3. Given a keyword query Q = {k

, . . . , k

}, an XML tree Xtree, SLCA(Q, Xtree) is

deﬁned as SLCA(Q, Xtree) = {n|n ∈ LCA(Q, Xtree) ∩ ¬(∃n

′

∈ LC A(Q, Xtre e), n ≻ n

′

)}, where n ≻ n

′

means n is an ancestor of n

′

XSeek was introduced by Liu et al. [16]. In XSeek, nodes in XML tree a re grouped into thr ee categories:

entity node , attribute node and connection node.

Deﬁnition 4. Entity node: If a node has siblings under the sa me name, then this indicates a many-

to-one relationship with its parent node, and is consider ed to represent an entity. E.g. <workshop>,

<paper> in Figure 4 are c onsidered as entity nodes.

Deﬁnition 5. Attribute node: If a node does not have siblings of the same name, and it has one

child, which is a value, then it is c onsidered to represent an attribute. E.g. <date>, <title>, <editors>,

<author> are deﬁned as attribute nodes .

Deﬁnition 6. Connection node: A node is a connectio n node if it repres ents neither an entity nor an

attribute. E.g. < proceedings> in Figure 4 is a connection node.

Given a keyword query, XSeek scans the candidate result list, and then replaces each non-entity node

with the nearest entity node on its path to the root. This process guarantees that each node returned to

the user is an entity node. Same as XKSearch, the ranking strategy is not discussed in XSeek.

XReal was put forward by Bao et al. [17]. XReal utilizes the statistics of underly ing XML data to

attack the ranking problem. Firstly it identiﬁes the s earch for node s and search via nodes of a query, a nd

then the search engine ranks the individual matches of all candidate results by using an XML TF*IDF

strategy. Nevertheless, XReal is unable to solve the ambiguity problem.

3 Preliminaries

3.1 System framework

Figure 3 describes the framework of our search engine. The inverted index of the data collection is

initially processed in the background. Afterwards, when user submits a query to the interface, the search

engine ﬁrstly retrieves the relevant elements ac cording to the de ﬁnition of the semantic model maximal

lowest common ances tor (MAXLCA). The extracted elements are disordered results of the query. Thus,

we use a ranking model BM25 to rank these disordered element results, and the output of the processing

is a ranked list. To further improve the eﬀect of the ranking module, we re-rank the top several r e sults

in the ranked list by distribution measurements. In distribution measurements, there are four criterions

based on the distribution of keywords taken into consideration, explicitly introduced in Section 4, and

the weights of these four measurements in the ﬁnal ranking function are tr ained by a learning to optimize

剩余16页未读，继续阅读

weixin_38548394

粉丝: 2

XDist：XML关键字搜索的改进系统，利用分布重排提升性能

pytest-xdist, 在故障测试模式下，分布式测试和循环的py.test 插件.zip

Python晚安代码：单元测试，确保代码质量

Testbed：单元测试方法V1.0的6大策略掌控

Python扑克牌项目单元测试：确保代码质量的秘诀

Python代码运行时间优化：代码可读性与可维护性

【Python单元测试专家】：提升代码质量的必备技能

软件测试自动化：提升效率与覆盖率的终极策略

PyCharm脚本测试策略：单元测试与集成测试实战的专家指南

确保任务可靠性：twisted.internet.task模块的测试策略

django.test.simple测试技巧：异步测试的方法和最佳实践

最新资源