深度学习索引：超越传统结构的探索

需积分: 14 125 浏览量更新于2024-07-09 收藏 1.1MB PDF 举报

本文档《The Case for Learned Index Structures》发表于2018年的SIGMOD会议上，作者是来自MIT和Google的多位专家。该研究论文的核心议题在于探索和提倡一种新型的数据库索引结构——"learned indexes"，它挑战了传统索引设计的框架。索引在数据库管理系统中起着关键作用，如B-Tree索引用于有序数组中的记录定位，哈希索引处理无序数组，而位图索引则用于判断数据记录是否存在。传统索引结构被视作预定义的模式或模型，如B-Tree通过排序规则确定记录的位置。然而，论文提出一个创新的观点：所有现有索引结构可以被更高级别的模型所替代，特别是深度学习模型，这些被称为"learned indexes"。这种新类型的索引的核心理念是，模型能够学习查找键的排序顺序或结构，并利用这种学习到的信息来高效地预测记录的位置或是否存在。论文对learned indexes的优势进行了理论分析，探讨了在哪些情况下它们可能超越传统的索引结构。例如，当数据的内在模式复杂、难以用简单规则描述时，或者在处理大规模、高维度数据时，learned indexes可能会展现出更高的性能。此外，论文还讨论了设计learned indexes的主要挑战，包括如何有效训练模型、如何处理实时查询的高效性、以及如何保证模型的稳定性和可扩展性。研究者们深入挖掘了机器学习技术如何与数据库索引结合的可能性，这标志着数据库设计的一个潜在革命，可能改变未来数据存储和检索的方式。然而，这也提出了新的研究问题，如如何权衡模型的准确性和效率，以及如何在实际系统中实现和部署这样的学习型索引结构。这篇论文不仅提供了一个新颖的视角来理解索引设计，还对未来数据库技术的发展方向提出了富有洞察的思考。它对于IT专业人士，特别是数据库和机器学习领域的从业者来说，是一篇具有前瞻性和实用价值的重要文献。

2.3 A First, Na

ıve Learned Index

To better understand the requirements to replace B-Trees through learned models, we used 200M web-server

log records with the goal of building a secondary index over the timestamps using Tensorﬂow [

]. We trained

a two-layer fully-connected neural network with 32 neurons per layer using ReLU activation functions; the

timestamps are the input features and the positions in the sorted array are the labels. Afterwards we measured

the look-up time for a randomly selected key (averaged over several runs disregarding the ﬁrst numbers) with

Tensorﬂow and Python as the front-end.

In this setting we achieved

≈ 1250

predictions per second, i.e., it takes

≈ 80, 000

nano-seconds (ns)

to execute the model with Tensorﬂow, without the search time (the time to ﬁnd the actual record from the

predicted position). As a comparison point, a B-Tree traversal over the same data takes

≈ 300ns

and binary

search over the entire data roughly

≈ 900ns

. With a closer look, we ﬁnd our na

ıve approach is limited in a

few key ways:

Tensorﬂow was designed to efﬁciently run larger models, not small models, and thus, has a signiﬁcant

invocation overhead, especially with Python as the front-end.

B-Trees, or decision trees in general, are really good in overﬁtting the data with a few operations as they

recursively divide the space using simple if-statements. In contrast, other models can be signiﬁcantly

more efﬁcient to approximate the general shape of a CDF, but have problems being accurate at the

individual data instance level. To see this, consider again Figure 2. The ﬁgure demonstrates, that from

a top-level view, the CDF function appears very smooth and regular. However, if one zooms in to the

individual records, more and more irregularities show; a well known statistical effect. Thus models

like neural nets, polynomial regression, etc. might be more CPU and space efﬁcient to narrow down

the position for an item from the entire dataset to a region of thousands, but a single neural net usually

requires signiﬁcantly more space and CPU time for the “last mile” to reduce the error further down

from thousands to hundreds.

B-Trees are extremely cache- and operation-efﬁcient as they keep the top nodes always in cache

and access other pages if needed. In contrast, standard neural nets require all weights to compute a

prediction, which has a high cost in the number of multiplications.

3 The RM-Index

In order to overcome the challenges and explore the potential of models as index replacements or optimizations,

we developed the learning index framework (LIF), recursive-model indexes (RMI), and standard-error-based

search strategies. We primarily focus on simple, fully-connected neural nets because of their simplicity and

ﬂexibility, but we believe other types of models may provide additional beneﬁts.

3.1 The Learning Index Framework (LIF)

The LIF can be regarded as an index synthesis system; given an index speciﬁcation, LIF generates different

index conﬁgurations, optimizes them, and tests them automatically. While LIF can learn simple models

on-the-ﬂy (e.g., linear regression models), it relies on Tensorﬂow for more complex models (e.g., NN).

However, it never uses Tensorﬂow at inference. Rather, given a trained Tensorﬂow model, LIF automatically

extracts all weights from the model and generates efﬁcient index structures in C++ based on the model

speciﬁcation. Our code-generation is particularly designed for small models and removes all unnecessary

overhead and instrumentation that Tensorﬂow has to manage the larger models. Here we leverage ideas

from [

], which already showed how to avoid unnecessary overhead from the Spark-runtime. As a result,

剩余29页未读，继续阅读

weixin_43320862

粉丝: 0
资源: 1

深度学习索引：超越传统结构的探索

The Case for Learned Index Structures

b+树，learning index

#GDC2018 NVIDIA Vulkan Update - GDC 2018.pdf

中考英语语法复习------主谓一致.pdf

software-engineering-at-google-lessons-learned-from-programming-over-time.pdf

Benefits and Challenges of Model-based Software Engineering-Lessons Learned.pdf

C++ Primer-4th.pdf

NIST SP800-131A.pdf

Unit6-Section+A+-+exercises.pdf

人大附小升初试题-9页.pdf

最新资源