The bigdata® RDF Database Technical Whitepaper
SYSTAP, LLC Page 5 of 25 5/29/2013
In bigdata, an index maps unsigned byte[] keys to byte[] values
6
. Mechanisms are provided
which support the encoding of single and multi-field numeric, ASCII, and Unicode data.
Likewise, extensible mechanisms provide for (de)serialization of application data as byte[]s for
values. An index entry is known as a “tuple”. In addition to the key and value, a tuple contains
a “deleted” flag which is used to prevent reads through to historical data in index views,
discussed below, and a revision timestamp, which supports optional transaction processing
based on Multi-Version Concurrency Control (MVCC)
7
. The IndexMetadata object is used to
configure both local and scale-out indices. Some of its most important attributes are the index
name, index UUID, branching factor, objects that know how to serialize application keys and
both serialize and deserialize application values store in the index, and the key and value coder
objects.
The B+Tree never overwrites records (nodes or leaves) on the disk. Instead, it uses copy-on-
write for clean records, expands them into Java objects for fast mutation and places them onto a
hard reference ring buffer for that B+Tree instance. On eviction from the ring buffer, and during
checkpoint operations, records are coded into their binary format and written on the backing
store.
Records can be directly accessed in their coded form. The default key coding technique is
front coding, which supports fast binary search with good compression. Canonical Huffman
8
9
coding is supported for values. Custom coders may be defined, and can be significantly faster
for specific applications.
The high-level API for the B+Tree includes methods that operate on a single key-value pair
(insert, lookup, contains, remove), methods which operate on key ranges (rangeCount,
rangeIterator), and a set of methods to submit Java procedures that are mapped against the
index and execute locally on the appropriate data services (see below). Scale-out applications
make extensive use of the key-range methods, mapped index procedures, and asynchronous
write buffers to ensure high performance with distributed data.
The rangeCount(fromKey,toKey) method is of particular relevance for query planning. The
B+Tree nodes internally track the #of tuples spanned by a separator key. Using this
information, the B+Tree can report the cardinality of a key-range on an index using only two key
probes against the index. This range count will be exact unless delete markers are being used,
in which case it will be an upper bound (the range count includes the tuples with delete
markers). Fast range counts are also available on a federation, where a key-range may span
multiple index partitions.
Scale-Up Architecture
The Journal manages a backing store, provides low-level mechanisms for writing and reading
allocations on that file, and has higher-level mechanisms for registering and operating on
indices. There are several different backing store models for the Journal. The most important
are described below.
6
We are reviewing this design decision with respect to column-wise storage.
7
Reed, D.P.. "Naming and Synchronization in a Decentralized Computer System". MIT dissertation.
http://www.lcs.mit.edu/publications/specpub.php?id=773
8
Huffman coding, http://en.wikipedia.org/wiki/Huffman_coding
9
Canonical Huffman coding, http://en.wikipedia.org/wiki/Canonical_Huffman_code