Keyword Query Approach over RDF Data Based on Tree Template
Qiang Sima
School of Computer Science and Engineering,
Southeast University
Nanjing, P.R.China 210096
220131534@seu.edu.cn
Huiying Li
School of Computer Science and Engineering,
Southeast University
Nanjing, P.R.China 210096
huiyingli@seu.edu.cn
Abstract—With the large increment of semantic data available
in the web, the demand for access to Semantic data is
increasing. Keyword query is regarded as an intuitive
paradigm, especially for the users who are not familiar with
the data and the RDF query language. In this paper, an
approach based on tree template is proposed, which can return
ranked answers to keyword query without the help of data
schema. With the proposed approach, K-step tree and tree
template are presented to support efficient keyword query. K-
step tree can be derived by breadth-first traversal of RDF data
graph, while tree template extracts the schema of K-step tree.
The online process conducted keyword query problem into two
steps: (1) find K-step trees containing all query keywords; (2)
generate minimal sub-trees as candidate results from K-step
trees. Search space of sub-graph is greatly reduced from the
original entire RDF data graph to the K-step tree. The
experimental results show that the approach is feasible and
effective.
Keywords-keyword query; tree template; K-step tree; RDF
data
INTRODUCTION
In recent years, with some projects such as DBPedia and
Linking Open Data in full swing, the online semantic data
has dramatically increased. Faced with such a large scale of
RDF data, the problem that how to help users query data
conveniently is one of the most concerned problems in
current semantic web search. Although W3C organization
has proposed formal queries for RDF data, such as
SPARQL[1], it is too complicated for ordinary users to write
even a relatively simple query without detailed knowledge of
data schema and query language.
The keyword query has proven to be a popular
information search method on the Web because of the fact
that the user does not need to know any query language or
underlying structure of data. Many approaches (e.g. [2], [3],
[4], [5], [6]) implement information retrieval (IR) strategies
on top of traditional database systems. However, none of
them can be directly applied to keyword query on RDF data
because its underlying data model is a graph rather than
relational data. Recently, there are some research work (e.g.
[7], [8]) dealing with keyword query to semantic search.
However, these works mainly focus on translating keyword
queries and cannot deal with schema-less data.
In this paper, we focus on implementing efficient ranked
keyword query on RDF data. On a large data graph, many
substructures may contain the query keywords, we restrict
answers to those connected substructures that are "minimal".
The main contributions of this paper are as follows.
1. We propose a keyword query model on RDF data. We
partition a large data graph into many K-step trees. The
index stores the content nodes as well as edge labels of K-
step trees, which makes the keyword query problem being
converted into a string matching problem. An answer is a
minimal sub-tree of K-step trees containing all query
keywords.
2. We present a keyword query algorithm based on tree
template. We explore the relationship between tree template
and content nodes of K-step trees, which makes us generate
minimal sub-trees as candidates quickly when K-step trees
are determined. We sort candidates and return top-k
candidates as final results to users.
A KEYWORD QUERY MODEL ON RDF DATA
For keyword query methods which support graph-
structure data currently, some are suitable for tree, some are
for graph. Two main steps are involved in these methods.
Firstly, the methods retrieve those parts of the data graph that
match users keywords with a subsequent identification of
any connections across graph segments returned by the query
process. Secondly, the methods rank the combined graph
segments, and top-k results are presented from the candidate
result-set. While locating nodes matching the keywords is
relatively efficient using index information, determining the
connections between graph segments is a complex and time-
consuming task that must be solved on-the-fly at query
processing phase.
In order to avoid time-consuming searching at query
processing phase, paper[9] proposed clustering technique
that reduces the search space by avoiding the exploration of
overlapping solutions, analyzing the paths in the graph and
extracting their schemas as templates. Paths with the same
template are grouped together in a cluster, each cluster being
represented by a template. Solutions, answers to the input
search, are built by composing paths starting with the most
relevant path in the most relevant cluster. Such strategy tends
to produce solutions exhaustive but not optimally specific,
leading to the solution generated at each step may not be the
optimum solution. On this basis, we proposed tree template
query model, using tree template to store the schema of
original data graph so that final candidate results will follow
the original structure, and eventually they can meet the users'