206 X. Cheng et al.
differentiated semantic of absent values and a 4-valued logic are introduced. We
present an experimental comparison of our model to several open source data stores in
section 5. Finally, the Section 6 discusses our conclusions and future work.
2 Related Work
Due to the tremendous increase in the scale of generated sparse data, various methods
have been proposed in order to model incomplete data. Based on the adopted logical or
physical layout, they can be grouped into two types:
⑴
The first category tries to
capture the sparse characteristic with distinct logic structure. The native idea is
to decompose a sparse table into a number of smaller and denser tables. One way is to
store a few “dense” attributes that most rows defined in a horizontal table, then
relegates the rest of attribute-value pairs to a large text file. It is on the premise that the
distribution of the non-null values must conform to this multi-table schema[3]. Another
way is to use the 3-ary vertical representation[4]. A single row in a horizontal table is
split into as many rows as the number of non-null attributes. The schema evolution
is just an addition and deletion of a row. However writing SQL queries on vertical table
is much more difficult than on the horizontal table. Decomposed storage model (DSM)
[13] and BigTable [5] are also of this kind.
⑵
The second type try to decouple the
logical and physical storage of entities. It remains the logical relational schema of upper
layer unchanged and tackles the problem by way of delicate physical storage strategy.
For the row-oriented storage[2], we can either use a placeholder to replace each
appearance of the absent values or omit the missing values all together. While the
former option is a waste of space, the latter one slows down tuple random access[21].
Positional format and interpreted attribute layout[1] belong to this kind. By contrast, in
the column-oriented storage, such as C-Store[11], MonetDB[10], the popular way to
deal with NULL is to treat it as a special value and resort to different compression
methods to compress the values[3]. While improving query efficiency and facilitating
schema evolution, this method generally suffers from higher cost of inserts(tuple
fragmentation) and record reconstruction[21]. In addition, other works[7][8][9] make
choice of an eclectic method to combine the formats of NSM and DSM, also known as
hybrid representation. For a given relation, PAX[8] stores the same data on each page
as NSM. Within each page, however, it groups all the values of a particular attribute
together on a minipage. Similarly, RCFile[7] applies the concept of “first
horizontally-partition, then vertically-partition” as well. With the complicated internal
structure, it suffers from high overhead of schema evolution and serves as a storage
structure for the almost read-only data warehouse system.
3 The Layered and Configurable Storage Structure – Dynamic
Table
To address the problem of sparse and dense data representation in the cloud, we resort
to a layered and configurable storage structure to describe the dataset. As illustrated in
Fig. 1, the storage structure is composed of 3 layers. While the upper two layers give
the logical layout of a dataset, the underlying layer defines the physical storage format.
When we said the Storage Structure is configurable, we do mean that the tabular