Optimization of Column-oriented Storage Compression Strategy Based on Hbase
Jingchao Sun
School of Information Technology & Network Security
People’s Public Security University of China
Beijing, China
Tianliang Lu
School of Information Technology & Network Security
People’s Public Security University of China
Beijing, China
Abstract—In order to solve the problem of high learning cost
and low compression efficiency caused by large data dispersion,
this paper presents a sorted-based compression strategy
selection method for HBase. Firstly, a method to sort the data
in each column is designed according to the characteristics of
HBase to strengthen the data compaction. Secondly, according
to the characteristics of the data, a column-based compression
strategy is proposed to recommend the compression scheme.
Experiments on TPC-DS standard dataset show its competitive
performance as compared with the other state-of-the-art
methods.
Keywords-Column-oriented storage; Data compression;
HBase; Selection method of compression strategy
I. INTRODUCTION
Data compression is a commonly used method of
database storage system to improve performance, which can
save storage space, make the data come more intensive to
reduce seek times, increase data transfer and pool hit rate
and reduce I/O, thereby easing the problem of unmatched
development between cpu and disk to improve query
efficiency. Nowadays, the compression strategy based on
the characteristics of the column store already have many
research results. However, the work of predecessors has not
realized the importance of sorting. The data in various parts
have large distribution and high degree of dispersion
making it not suitable for compression. The choice of
compression granularity has been tended to small
granularity compression strategy, but the small granularity
strategy needs to count the statistical information of each
sector, which result into a high calculation cost. Since there
has not been much interest in the selection of compression
scheme , the compression rate cannot be guaranteed.
In order to solve the problem of large data distribution
dispersion in the existing column storage database and the
complicated learning cost , we have done a detailed study
and made the following contribution:
1) A method of sorting data prior to compression was
proposed. The columns are split, the structure of the
columns are split, the data can be sorted in accordance with
the order of the stored in the region and avoid hot-spot
issues, so that the data can be arranged closely to minimize
the data in the local data Differences in distribution.
2) After making a scrutiny of lightweight and
heavyweight compression schemes, the compression
schemes suitable for HBase have been selected.
3) A sorted-based compression strategy selection
method is proposed. This method takes advantage of the
compact feature of compressed data, and uses different
compression scheme in the light of different data
characteristics, which achieves good compression rate while
ensuring low computation cost.
The rest of this paper is organized as follows: Section 2
introduces prior efforts on column-oriented storage
compression strategy. A novel Sorted-based compression
strategy selection method is proposed in Section 3. An
experimental evaluation of the performance of the method is
given in section 4. and section 5 concludes this paper.
II. R
ELATED WORK
Research on the strategy of column storage compression
strategy began with a compression sub-system on C-store
[1]. J. Abadi et al. [2] proposed a column compression
model based on decision tree. The compression model
established a compression scheme decision tree to choose
the best compression scheme for each column. However,
this method ignored the impact of the local data features and
the distribution of the data on compression.
Wang Zhenxi et al. [3] proposed a sector-based
compression strategy, dividing the data by sectors, choosing
the most suitable compression scheme based on the
correlation and difference between partitions. This method
can choose different compression schemes according to the
properties of different sectors to ensure the compression rate,
but if the similarity between sectors are too large, it will
result in a large amount of computation.
Idreos et al. [4] proposed a dynamic selection strategy of
compression scheme based on Bayesian classification.
compression schemes were chosen for different data sectors
by Bayesian formula to get the best compression effect.
However, the accuracy of this method depends greatly on
the training samples, ands does not establish an evaluation
layer to evaluate the compression scheme based on the
feedback results.
Wang Haiyan et al. [5] proposed a compression strategy
selection method based on hot and cold data classification.
Firstly, based on the frequency of data access, the HBase
data was divided into hot and cold data. Secondly, a new
compression classification method was proposed by
combing an evaluation-layer-added Bayesian classification
compression with the sector-based compression strategy.
24
International Conference on Big Data and Artificial Intelligence
2018
978-1-5386-6136-9/18/$31.00 ©2018 IEEE