SLCA算法在XML数据库中的高效关键词搜索

1星需积分: 10 117 浏览量更新于2024-09-17 收藏 274KB PDF 举报

SLCA算法（Smallest Labeled Containment Tree Algorithm）的实现是计算机科学领域的一项重要研究，尤其是在XML数据库中的关键词搜索技术。这项工作由Yu Xu和Yannis Papakonstantinou在UC San Diego的计算机科学与工程系完成，并发表于一篇经典论文。论文的标题是"Efficient Keyword Search for Smallest LCAs in XML Databases"，表明其关注的是如何在XML文档中高效地执行关键词查询，通过将XML文档模型化为带标签的树结构。在传统的HTML文档查询中，关键词搜索是一种用户友好的查询方式。然而，对于XML文档，由于其结构复杂且数据类型多样，如何找到包含所有关键词的最小树（smallest labeled containment tree，简称SLCT）成为了一个挑战。SLCT被定义为含有所有关键词，且没有子树同时包含全部关键词的树。这种设计确保了结果的简洁性和效率。论文的核心贡献是提出了名为Indexed Lookup Eager（ILE）的算法。ILE算法充分利用了最小树的关键特性，特别是当查询中的关键词频率差异显著时，能够极大地优于先前的算法，通过智能地组织和索引数据，显著提高了搜索性能。这表明ILE算法在处理稀疏数据时表现出色，即某些关键词在文档中出现次数较少。为了适应关键词频率相似的情况，论文还介绍了Scan Eager（SE）变体，这是一种针对关键词频率相近场景进行优化的搜索策略。这两种Eager算法的设计旨在提供针对不同查询条件的高效解决方案。作者不仅从理论角度分析了ILE和SE算法的工作原理，还通过实验评估了这两种算法以及与Stack算法[13]的性能对比。Stack算法作为基准，可能是一种基于其他搜索策略的方案。实验结果揭示了Eager算法在效率和性能上的优势，为XML文档的关键词搜索提供了一种新的、高效的搜索框架。这篇论文不仅深化了我们对XML文档结构处理的理解，也为实际应用中处理大量数据和满足用户快速准确查询需求提供了实用的方法。对于那些从事XML处理、信息检索或数据库系统的开发者和研究人员来说，理解和掌握SLCA算法的实现及其优化方法是十分重要的。

 

   







  

   

Notice that the query result

   

  

  

   

   

  

  

   

and that

   







  

   

 

 

 

 











 

  







  

    

where

 

 

 

 













removes ancestor nodes from its input.

The function





   

computes the right match of



in a set



that is the node of



that has the smallest id that is greater than or

equal to

    

;





   

computes the left match of



in a set



that is the node of



that has the biggest id that is less than or equal

    





   

(





   

) returns null when there is no right

(left) match node. The cost of





   

(





   

) is

 

   







since it takes

 

   







steps (Dewey number comparisons) to ﬁnd

the right (left) match node and the cost of comparing two Dewey

numbers is

  

. The function

 

 

















 





returns the

other argument when one argument is null and returns the descen-

dant node when





and





have ancestor-descendant relationship.

The cost of the function

 

 













  

3. ALGORITHMSFORFINDINGTHESLCA

OF KEYWORD LISTS

This section presents the core Indexed Lookup Eager algorithm,

its Scan Eager variation and the prior work Stack algorithm [13].

A brute-force solution to the SLCA problem computes the LCAs

of all node combinations and then removes ancestor nodes. Its

complexity is

  









  



 





. Besides being inefﬁcient the

brute-force approach is blocking. After it computes an LCA

 

  

  

  

   

for some

    

  

    

, it cannot report



as an answer since there might be another set of



nodes





  

 " 

such that

 

  

"  

  

 "





The complexity analysis given in this section is for main memory

cases. We will give disk access complexity in Section 4 after we

discuss the implementation details of how we compress and store

keyword lists on disk. In the sequel we choose





to be the smallest

keyword list since

   

  

  

    

   









  

 







, where





 







is any permutation of









  

 

, and there is a beneﬁt

in using the smallest list as

 

as we will see in the complexity

analysis of the algorithms.

3.1 TheIndexedLookupEagerAlgorithm(IL)

The Indexed Lookup Eager algorithm is based on four properties

of SLCAs, which we explain starting from the simplest case where

 



and





is a singleton

  







  

 







   

       

  

 















  

 





     

  

  



      

According to the above Property (1), we compute the LCA of



and its left match in



, the LCA of



and its right match in



, and

the singleton formed from the deeper node from the two LCAs is

   

      

. Property (1) is based on the following two observa-

tions. For any two nodes

    

to the right (according to preorder)

of a node



, if

         



     





, then

  

  



 

  

    

; similarly, for any two nodes

    

to the left of a node



, if

   



     



      

, then

  

  



 

  

  





We generalize to arbitrary



when the ﬁrst set is a singleton.

Notice the recursiveness in Property (2).







  

 







   

     





  

    

   



   

       

  

 







  





 

 





The right or left match of a node





is itself if

  

. This

may happen when a node’s label contains multiple keywords.

The two observations apply to inorder and postorder as well.

Next we generalize to arbitrary











  

 







   







  

    

 

 

 

 











 





   

   

  



  





 

    

Property (3) straightforwardly leads to an algorithm to compute

   





 





  

   

: ﬁrst computes





  

   

      





 

   

for each

   



(



  



 

 

 

 











  







  





 

is the answer. Each





is computed by using Properties (2) and (1).

The beneﬁt of the above algorithm over the brute force approach

is that for each node





 

, the algorithm does not compute

  







 





  

   

for all





 





  

     

, but computes a

single

  





    

  

 





where each





(



  



) is computed

by the match functions (





and





). The complexity of the al-

gorithm is

 



 

 











   





  

 







 



 



 



  



 













where









(







) is the minimum (maximum) size of key-

word lists





through

 

because for each node









the algo-

rithm needs to ﬁnd a left and a right match in each one of the other







keyword lists. Finding a match in list

 

costs

 

   

 





Hence the total cost of match operations is

 



 















   









The total cost of the

  

and

 

 













operations is

 









  

and hence is dominated by the cost of the match operations. The











factor is attributed to the cost of removing ancestors opera-

tion.

The subroutine







   

, based on the following two lemmas,

computes

   





 





efﬁciently by removing ancestor nodes on

the ﬂy.

LEMMA 1. Given any two nodes

   



and a set



, if

      

   





and

   

   

  



    



   

   

  



    

then

   

  



    

   

       

LEMMA 2. For any two nodes





 



and a set



such that

          





and

   

   

             

   

  



    

   

  



   

is not an ancestor of

   

  



   

, then for any



such that

    



   





   

        #

   

      

Consider

    

sorted by id, where

  



  

  

 



. Let









  







where







   

  



  





  







   

  



  





According to Lemma 1, if

   



 



   

 



where node

 

ap-

pears after





(that is,

" 



), then





is an ancestor node. Thus

when computing the list

, we can discard the out-of-order nodes

such as





. The resulting list



is in order and contains the nodes

   





 





. However



is not necessarily ancestor node free.

Consider any two adjacent nodes





where





is after



. If



is not an ancestor of





, then



cannot be an ancestor of

any node



 

that is after





(according to Lemma 2), which

means



is a



! #



 

 

. Lemma 1 and 2 together lead to the

subroutine







   

that computes

   





 





efﬁciently. Line

#5 in







   

applies Lemma 1 to remove out-of-order nodes, and

lines #6-8 apply Lemma 2 to identify a SLCA as early as possible.

As can be seen from







   

, at any time only three nodes (





) are needed in memory.

Consider





  

      

       

     

     





      

      

    

 



  

 



   

] (the key-

word lists for “John” and “Ben” respectively). In the ﬁrst itera-

tion of the loop at line #3,

" 



 

  



=0 (line #4). At

the end of the ﬁrst iteration

" 



(line #8). In the second iter-

ation,

 

     





  

" 

  

. In the third iteration,

 

       





    

" 

    

. In the fourth iteration,

 

     





   

(line #4). Notice that the condition at

剩余11页未读，继续阅读

tssandzl

粉丝: 0

SLCA算法在XML数据库中的高效关键词搜索

xml关键字查询求SLCA代码

lca.rar_LCA

LCA (最近公共祖先) Tarjan & 倍增

LCA倍增算法python

LCA用tarjan算法

Tarjan 算法 LCA

带权树上选A,B,C三点，求A到B的距离+B到C的距离的最大值 O(nlogn)算法

倍增算法用到的数据结构

lca单词复杂度分析

c++算法实现，在二叉排序树上找出任意两个不同结点的最近公共祖先

最新资源