探索相似性搜索:度量空间索引结构

需积分: 9 6 下载量 126 浏览量 更新于2024-07-29 收藏 11.61MB PDF 举报
"《Similarity Search: The Metric Space Approach》是关于在大量数据中进行相似性搜索的先进技术的深入探讨。这本书分为两部分,详细介绍了各种指数结构和方法,适用于广泛的领域,包括数据库系统、多媒体应用和网络信息检索。书中讨论了理论基础,并提供了大量特定技术的综合调查,特别关注在大规模数据集中的搜索策略。" 本书首先阐述了在当前信息爆炸的时代,传统的检索技术已经无法满足高效、有效搜索的需求,因此近似搜索(proximity searching)成为了一个关键的计算任务。相似性搜索(Similarity Search)是解决这一问题的重要手段,它着重于在度量空间中寻找与查询对象最相似的项。 第一部分主要涉及理论原则,深入讲解了度量空间的概念和特性。度量空间是一个数学结构,其中的每个元素都有一个度量,用于量化元素间的距离。这部分涵盖了各种应用领域中的特定搜索技术,包括但不限于欧几里得距离、曼哈顿距离、余弦相似度等。这些技术被广泛应用于图像识别、文本相似性分析、语音识别等领域。 第二部分则专注于大规模数据集的搜索方法。这部分首先介绍了流行的集中式磁盘基础的度量索引,如kd树、球树等,这些索引结构能够有效地处理高维数据。随后,书本讨论了近似技术,通过牺牲部分精确度来显著提高搜索速度,这对于实时或大数据量的场景尤为重要。最后,作者探讨了可扩展性和分布式度量结构,这些结构适应于处理不断增长的数据和分布式计算环境,例如MapReduce模型下的相似性搜索。 书中的每一章节都为读者提供了丰富的实际案例和实验结果,以帮助读者理解并应用这些理论和技术。此外,还引用了众多研究者的工作,展示了相似性搜索领域的最新进展和未来可能的研究方向。 《Similarity Search: The Metric Space Approach》是一本对信息技术专业人士极其有价值的资源,无论是数据科学家、数据库管理员还是机器学习工程师,都能从中获得关于如何在海量信息中实现高效相似性搜索的宝贵知识。
2009-07-11 上传
Part I Metric Searching in a Nutshell Overview 3 1. FOUNDATIONS OF METRIC SPACE SEARCHING 5 1 The Distance Searching Problem 6 2 The Metric Space 8 3 Distance Measures 9 3.1 Minkowski Distances 10 3.2 Quadratic Form Distance 11 3.3 Edit Distance 12 3.4 Tree Edit Distance 13 3.5 Jaccard’s Coefficient 13 3.6 Hausdorff Distance 14 3.7 Time Complexity 14 4 Similarity Queries 15 4.1 Range Query 15 4.2 Nearest Neighbor Query 16 4.3 Reverse Nearest Neighbor Query 17 4.4 Similarity Join 17 4.5 Combinations of Queries 18 4.6 Complex Similarity Queries 18 5 Basic Partitioning Principles 20 5.1 Ball Partitioning 20 5.2 Generalized Hyperplane Partitioning 21 5.3 Excluded Middle Partitioning 21 5.4 Extensions 21 6 Principles of Similarity Query Execution 22 6.1 Basic Strategies 22 6.2 Incremental Similarity Search 25 7 Policies for Avoiding Distance Computations 26 7.1 Explanatory Example 27 7.2 Object-Pivot Distance Constraint 28 7.3 Range-Pivot Distance Constraint 30 7.4 Pivot-Pivot Distance Constraint 31 7.5 Double-Pivot Distance Constraint 33 7.6 Pivot Filtering 34 8 Metric Space Transformations 35 8.1 Metric Hierarchies 36 8.1.1 Lower-Bounding Functions 36 8.2 User-Defined Metric Functions 38 8.2.1 Searching Using Lower-Bounding Functions 38 8.3 Embedding Metric Space 39 8.3.1 Embedding Examples 39 8.3.2 Reducing Dimensionality 40 9 Approximate Similarity Search 41 9.1 Principles 41 9.2 Generic Algorithms 44 9.3 Measures of Performance 46 9.3.1 Improvement in Efficiency 46 9.3.2 Precision and Recall 46 9.3.3 Relative Error on Distances 48 9.3.4 Position Error 49 10 Advanced Issues 50 10.1 Statistics on Metric Datasets 51 10.1.1 Distribution and Density Functions 51 10.1.2 Distance Distribution and Density 52 10.1.3 Homogeneity of Viewpoints 54 10.2 Proximity of Ball Regions 55 10.3 Performance Prediction 58 Contents ix 10.4 Tree Quality Measures 60 10.5 Choosing Reference Points 63 2. SURVEY OF EXISTING APPROACHES 67 1 Ball Partitioning Methods 67 1.1 Burkhard-Keller Tree 68 1.2 Fixed Queries Tree 69 1.3 Fixed Queries Array 70 1.4 Vantage Point Tree 72 1.4.1 Multi-Way Vantage Point Tree 74 1.5 Excluded Middle Vantage Point Forest 75 2 Generalized Hyperplane Partitioning Approaches 76 2.1 Bisector Tree 76 2.2 Generalized Hyperplane Tree 77 3 Exploiting Pre-Computed Distances 78 3.1 AESA 78 3.2 Linear AESA 79 3.3 Other Methods 80 4 Hybrid Indexing Approaches 81 4.1 Multi Vantage Point Tree 81 4.2 Geometric Near-neighbor Access Tree 82 4.3 Spatial Approximation Tree 85 4.4 M-tree 87 4.5 Similarity Hashing 88 5 Approximate Similarity Search 89 5.1 Exploiting Space Transformations 89 5.2 Approximate Nearest Neighbors with BBD Trees 90 5.3 Angle Property Technique 92 5.4 Clustering for Indexing 94 5.5 Vector Quantization Index 95 5.6 Buoy Indexing 97 5.7 Hierarchical Decomposition of Metric Spaces 97 5.7.1 Relative Error Approximation 98 5.7.2 Good Fraction Approximation 98 5.7.3 Small Chance Improvement Approximation 98 5.7.4 Proximity-Based Approximation 99 5.7.5 PAC Nearest Neighbor Search 99 x SIMILARITY SEARCH Part II Metric Searching in Large Collections of Data Overview 103 3. CENTRALIZED INDEX STRUCTURES 105 1 M-tree Family 105 1.1 The M-tree 105 1.2 Bulk-Loading Algorithm of M-tree 109 1.3 Multi-Way Insertion Algorithm 112 1.4 The Slim Tree 113 1.4.1 Slim-Down Algorithm 114 1.4.2 Generalized Slim-Down Algorithm 116 1.5 Pivoting M-tree 118 1.6 The M