云计算中并行与串行kNN查询处理：VI-HCO结构的对比

157 浏览量更新于2024-08-29 收藏 499KB PDF 举报

随着云计算技术的不断发展，支持大规模分布式数据处理的k近邻查询（k-Nearest Neighbor, kNN）能力对基于云的服务（Database-as-a-Service, DaaS）至关重要。本文主要探讨在云计算环境中，采用并行处理与顺序处理两种方法来执行kNN查询的优劣问题。首先，我们关注的是这两种相反的处理策略。并行处理通常涉及将任务分解为多个子任务，同时在多个计算节点上执行，以利用多核处理器和分布式资源的优势，从而加快查询速度。相比之下，顺序处理则更倾向于线性地逐个节点执行，依赖于单线程或有限的并发能力。为了深入探究这一问题，我们提出了一种新的分布式索引结构VI-HCO（Vector Index with Hierarchical Clustering and Orthogonal Partitioning）。这种结构的特点在于其能够快速定位到云计算中的关键节点，优化了数据分布和查询路径，有助于提升并行查询的效率。VI-HCO通过层次聚类和正交分区，实现了高效的数据组织和查询性能。基于VI-HCO，我们设计了两种处理方法：一种是并行处理方法，它利用了结构中的差异分片技术，将连续范围查询分解为一系列独立的操作，然后在不同的计算节点上并行执行，通过减少通信开销和并发处理大量数据来提高查询速度。这种方法尤其适用于大数据集和高并发环境。另一种是顺序处理方法，虽然可能不如并行处理那样迅速，但它的优势在于实现相对简单，对于资源有限或者数据规模较小的情况，可能表现出稳定的性能，并且减少了潜在的并行协调复杂性。然而，哪种方法更好取决于多种因素，如数据规模、硬件配置、网络带宽、查询频率以及系统资源的管理。在实际应用中，可能需要根据具体场景进行基准测试和性能评估，以确定最适合的处理策略。通过对比并行和顺序处理在VI-HCO结构上的表现，我们可以更好地理解在云计算环境下优化kNN查询的最佳实践。

Which is Better for kNN Query Processing in the Cloud:

Sequential or Parallel

∗

Chong Zhang

, Xiaoying Chen

, Bin Ge

, Weidong Xiao

∗

Science and Technology on Information Systems Engineering Laboratory

National University of Defense Technology, Changsha 410073, China

Collaborative Innovation Center of Geospatial Technology, China

{

leocheung8286,

chenxiaoying1991}@yahoo.com

gebin1978@gmail.com,

wilsonshaw@vip.sina.com

ABSTRACT

With the development of various Cloud system, providing

powerful kNN query capability to DaaS (Database as a Ser-

vice) is an essential requirement for many applications. In

this paper, we are interested in two opposite approaches for

processing kNN query in Cloud system, parallel processing

and sequential processing, and we want to explore the an-

swer of which one performs better. For addressing such a

question, we devise a new distributed indexing structure VI-

HCO, which is characterized by fast locating Cloud nodes

capability. Then parallel and sequential processing methods

are designed upon the structure. For parallel one, we take

diﬀerential cells between two consecutive range queries into

consideration, and for sequential one, we elaborately design

an accurate message delivery algorithm. We verify our ideas

through experiments, which is conducted on both synthetic

and real dataset, and the results show that VIHCO outper-

forms a previous work RT-CAN, and the sequential method

is more eﬃcient under small k query condition and small

system size, while parallel one suits for large k and large

scale of computing nodes.

Categories and Subject Descriptors

H.2.4 [Database Management]: Systems - query processing

General Terms

Algorithms, Measurement, Performance

Keywords

kNN query, Cloud, histogram, parallel, sequential

∗

This work is supported by NSF of China grant 61303062

and 71331008.

2016, Copyright is with the authors. Published in the Workshop Pro-

ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-

deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this

paper is permitted under the terms of the Creative Commons license CC-

by-nc-nd 4.0

1. INTRODUCTION

With the development of Cloud computing, various lay-

ers of computing resources are used in terms of pay-as-you-

go, such as IaaS (Infrastructure as a Service), PaaS (Plat-

form as a Service) and SaaS (Software as a Service). Nowa-

days, Database as a Service (DaaS)[2][13] is a hot topic for

database community in Cloud computing era. For DaaS

users, it is not necessary to focus on the location of database

instance, nor the physical storage mechanism of schema or

tables, not even data partition fashion or query processing,

in one word, the inner of database is transparent to the

users. They just deﬁne the structure of table, and insertion,

query or other operations seem similar to use a centralized

local database. However, it is possible for one table, data are

spread over many computing nodes, and querying processing

needs the collaboration of these nodes. And as data volume

increases, the database should be adaptive to the new scale

and new query requirement, i.e., it should be elastic.

In this paper, we focus on kNN query in DaaS, which is

an essential function for spatial database. Given a point

in the space, kNN query aims to ﬁnd k nearest objects to

the query point. This topic is addressed well in some previ-

ous works[13][9][14], however, there are two opposite ideas

to solve the problem in the state-of-the-art, namely, parallel

processing and sequential processing. Parallel method ex-

ploits the parallelism of Cloud nodes, and make them work

simultaneously, while sequential one uses the vicinity rela-

tionship between query point and Cloud nodes to accurately

deliver query messages. Nevertheless, which is better for

DaaS is not studied before. Hence, in this paper, we extend

our work in [14] to acquire the answer.

For comparing the two approaches, we use a previous work

RT-CAN as a baseline, and propose a new structure, called

VIHCO (VIcinity-based Hilbert Cloud Overlay), to index

spatial data in Cloud system and to process kNN query.

The feature of VIHCO is not only leveraged on fast look up

routing table (ﬁnger table), but also highlighted on vicinity

neighbors to quickly locate the nearby Cloud nodes. Based

on such structure, we present the designs of parallel and

sequential processing algorithms. Experiments on both syn-

thetic and real dataset show that VIHCO outperforms RT-

CAN, in eﬃciency and scalability, and the sequential method

is more proper under small k query condition and small sys-

tem size, while parallel one suits for large k and large scale

of computing nodes.

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38747126

粉丝: 5
资源: 921

云计算中并行与串行kNN查询处理：VI-HCO结构的对比

knn.zip_For Better_Multi SVM_knn_knn matlab_knn svm

什么是knn算法，有什么优缺点

matlab中knn代码-Intelligent-Algorithms:智能算法

matlab10折交叉验证knn代码-KNN_Algorithms:在数据挖掘过程中设计的算法

matlab中knn代码-HSIC_RPNet:用于HSI分类的改进型RPNet

KNN算法手写数字识别项目：Python源码实现

KNN均值滤波器对比高斯噪声：抑制与边缘保持

kNN回归器应用与开发：简单与多重回归分析

KNN手写数字识别与实现：Python和R语言教程

KNN手写数字和字母识别：源代码解读与应用

最新资源