An Efficient Multi-Dimensional Index for Cloud Data
Management
Xiangyu Zhang, Jing Ai, Zhongyuan Wang, Jiaheng Lu, Xiaofeng Meng
SchoolofInformation,RenminUniversityofChina
Beijing,China,100872
zhangxy@live.com{aijingruc,zhywang,xfmeng}@ruc.edu.cnjiahenglu@gmail.com
ABSTRACT
Recently, the cloud computing platform is getting more and more
attentions as a new trend of data management. Currently there
are several cloud computing products that can provide various ser-
vices. However, currently the cloud platforms only support sim-
ple keyword-based queries and can’t answer complex queries effi-
ciently due to lack of efficient index techniques. In this paper we
propose an efficient approach to build multi-dimensional index for
Cloud computing system. We use the combination of R-tree and
KD-tree to organize data records and offer fast query processing
and efficient index maintenance. Our approach can process typ-
ical multi-dimensional queries including point queries and range
queries efficiently. Besides, frequent change of data on big amount
of machines makes the index maintenance a challenging problem,
and to cope with this problem we proposed a cost estimation-based
index update strategy that can effectively update the index struc-
ture. Our experiments show that our indexing techniques improve
query efficiency by an order of magnitude compared with alter-
native approaches, and scale well with the size of the data. Our
approach is quite general and independent from the underlying in-
frastructure and can be easily carried over for implementation on
various Cloud computing platforms.
Categories and Subject Descriptors
C.2.4 [Computer-Communication Networks]: Distributed Sys-
tems—distributed applications; H.2.4 [Database Management]:
Systems—concurrency, transaction processing
General Terms
Algorithms
Keywords
multi-dimensional index, distributed index, query processing
1. INTRODUCTION
Internet has been developing at an astonishing speed. Each day
a huge amounts of information is put on the Internet in the form
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CloudDB’09, November 2, 2009, Hong Kong, China.
Copyright 2009 ACM 978-1-60558-802-5/09/11 ...$10.00.
of digital data. Many new Internet applications emerge and most
of them require to process a large scale of data efficiently. How-
ever, traditional data management tools have been insufficient for
this new demands. For example, database systems softwares of-
ten are multi-tenancy, which means that online users must share
the same software’s resources simultaneously. When unexpected
spikes come, users may meet the situation of shortage of resources
and a drop of quality of service. Therefore, scalability is a cru-
cial requirement for future Web applications. Under those circum-
stances, a new computing infrastructure, cloud computing, emerges.
Though the unified definition of cloud computing has not been con-
firmed[1], it is considered as a revolution in IT industry. Systems
supporting cloud computing dynamically allocate computational
resources according to users’ requests. Existing Cloud comput-
ing systems include Amazon’s Elastic Computing Cloud(EC2)[2],
IBM’s Blue Cloud[3] and Google’s MapReduce[4]. They adopt
flexible resources management mechanism and provide good scal-
ability. Scalable data structures can satisfy resource demands of
Cloud systems’ users. Cloud computing systems are usually com-
prised of a large number of computers, store huge amounts of data,
and provide services for millions of users. Resources allocation is
typically elastic in cloud systems, which makes each user feel that
he owns infinite amount of resources. A typical example of scalable
data structure is Google’s BigTable[5].
Currently, most of Cloud infrastructures are based on Distributed
File Systems. DFS usually use key-value storage models to store
data. The data in Cloud systems are organized in the form of key-
value pairs. Therefore, current Cloud systems can only support
keyword search. When a query comes, result data are retrieved
from DFS in accordance with contained keywords. Although many
famous Cloud systems uses this information storage pattern, such
as Google’s GFS[6] and Hadoop’s HDFS[7], they only provide ser-
vices of keyword queries for users. Therefore, users can only ac-
cess information through "point query" which matches records to
satisfy the verbal and/or numerical values.
The emergence of cloud computing is due to the need of increas-
ing advanced data management. And it needs to serve a large va-
riety of applications better for more Web users. Therefore, future
cloud infrastructures should be developed to support more types of
queries with more functions, e.g. muti-dimensional queries.
Cloud computing platforms contain hundreds and thousands of
machine nodes, and they process workloads and tasks in parallel.
This is a typical characteristic of cloud computing infrastructures.
When a user submits a query, result data are retrieved from un-
derlying storage tables and then distributed to a set of processing
nodes for parallel scanning. Without the support of efficient index
structure, query processing is quite time-consuming, especially for
complex queries. Therefore, building more efficient index structure