2.3 A First, Na
¨
ıve Learned Index
To better understand the requirements to replace B-Trees through learned models, we used 200M web-server
log records with the goal of building a secondary index over the timestamps using Tensorflow [
9
]. We trained
a two-layer fully-connected neural network with 32 neurons per layer using ReLU activation functions; the
timestamps are the input features and the positions in the sorted array are the labels. Afterwards we measured
the look-up time for a randomly selected key (averaged over several runs disregarding the first numbers) with
Tensorflow and Python as the front-end.
In this setting we achieved
≈ 1250
predictions per second, i.e., it takes
≈ 80, 000
nano-seconds (ns)
to execute the model with Tensorflow, without the search time (the time to find the actual record from the
predicted position). As a comparison point, a B-Tree traversal over the same data takes
≈ 300ns
and binary
search over the entire data roughly
≈ 900ns
. With a closer look, we find our na
¨
ıve approach is limited in a
few key ways:
1.
Tensorflow was designed to efficiently run larger models, not small models, and thus, has a significant
invocation overhead, especially with Python as the front-end.
2.
B-Trees, or decision trees in general, are really good in overfitting the data with a few operations as they
recursively divide the space using simple if-statements. In contrast, other models can be significantly
more efficient to approximate the general shape of a CDF, but have problems being accurate at the
individual data instance level. To see this, consider again Figure 2. The figure demonstrates, that from
a top-level view, the CDF function appears very smooth and regular. However, if one zooms in to the
individual records, more and more irregularities show; a well known statistical effect. Thus models
like neural nets, polynomial regression, etc. might be more CPU and space efficient to narrow down
the position for an item from the entire dataset to a region of thousands, but a single neural net usually
requires significantly more space and CPU time for the “last mile” to reduce the error further down
from thousands to hundreds.
3.
B-Trees are extremely cache- and operation-efficient as they keep the top nodes always in cache
and access other pages if needed. In contrast, standard neural nets require all weights to compute a
prediction, which has a high cost in the number of multiplications.
3 The RM-Index
In order to overcome the challenges and explore the potential of models as index replacements or optimizations,
we developed the learning index framework (LIF), recursive-model indexes (RMI), and standard-error-based
search strategies. We primarily focus on simple, fully-connected neural nets because of their simplicity and
flexibility, but we believe other types of models may provide additional benefits.
3.1 The Learning Index Framework (LIF)
The LIF can be regarded as an index synthesis system; given an index specification, LIF generates different
index configurations, optimizes them, and tests them automatically. While LIF can learn simple models
on-the-fly (e.g., linear regression models), it relies on Tensorflow for more complex models (e.g., NN).
However, it never uses Tensorflow at inference. Rather, given a trained Tensorflow model, LIF automatically
extracts all weights from the model and generates efficient index structures in C++ based on the model
specification. Our code-generation is particularly designed for small models and removes all unnecessary
overhead and instrumentation that Tensorflow has to manage the larger models. Here we leverage ideas
from [
25
], which already showed how to avoid unnecessary overhead from the Spark-runtime. As a result,
6