HBase列式存储压缩策略排序优化：性能与效率提升

131 浏览量更新于2024-08-27 收藏 246KB PDF 举报

本文主要探讨了"基于Hbase的列式存储压缩策略的优化"这一研究主题。作者 Jingchao Sun 和 Tianliang Lu 来自中国的中国人民公安大学的信息技术与网络安全学院，他们针对HBase数据库在处理大量分散数据时面临的高学习成本和低压缩效率问题，提出了一种创新的方法。首先，论文的核心内容是设计了一种针对HBase特性的数据排序方法。由于HBase采用列式存储结构，这种方法旨在通过增强数据的聚集性来改进数据压缩过程，减少不必要的存储空间并提升整体性能。通过将数据按特定规则进行排序，可以使得相似或相关的数据紧密相邻，从而提高压缩算法的适用性和效率。其次，作者提出了一个基于列的压缩策略推荐系统。该系统考虑了数据本身的特性，通过对不同类型的数据采取不同的压缩算法，如字典编码、哈夫曼编码或者LZ77/78等，来针对性地选择最适合的压缩方案。这种定制化的压缩策略有助于在保证数据完整性和可读性的前提下，最大限度地减少存储需求。为了验证其方法的有效性，实验部分使用了TPC-DS标准数据集进行了对比测试。结果显示，该列式存储压缩策略优化方案与当前最先进的压缩技术相比，展现出具有竞争力的性能。这表明，对于大规模、数据分散的HBase环境，该优化策略能够显著降低存储成本，提高系统资源利用率，并且对实时查询和数据分析的响应速度有所提升。这篇研究论文针对HBase列式存储的特点，提出了一个有效的数据排序和压缩策略选择方法，不仅解决了数据管理中的实际问题，也为其他数据库管理系统中的数据压缩优化提供了新的思路和实践指导。未来的研究可以进一步探索如何动态调整压缩策略以适应不断变化的数据特征和工作负载。

Optimization of Column-oriented Storage Compression Strategy Based on Hbase

Jingchao Sun

School of Information Technology & Network Security

People’s Public Security University of China

Beijing, China

Tianliang Lu

School of Information Technology & Network Security

People’s Public Security University of China

Beijing, China

Abstract—In order to solve the problem of high learning cost

and low compression efficiency caused by large data dispersion,

this paper presents a sorted-based compression strategy

selection method for HBase. Firstly, a method to sort the data

in each column is designed according to the characteristics of

HBase to strengthen the data compaction. Secondly, according

to the characteristics of the data, a column-based compression

strategy is proposed to recommend the compression scheme.

Experiments on TPC-DS standard dataset show its competitive

performance as compared with the other state-of-the-art

methods.

Keywords-Column-oriented storage; Data compression;

HBase; Selection method of compression strategy

I. INTRODUCTION

Data compression is a commonly used method of

database storage system to improve performance, which can

save storage space, make the data come more intensive to

reduce seek times, increase data transfer and pool hit rate

and reduce I/O, thereby easing the problem of unmatched

development between cpu and disk to improve query

efficiency. Nowadays, the compression strategy based on

the characteristics of the column store already have many

research results. However, the work of predecessors has not

realized the importance of sorting. The data in various parts

have large distribution and high degree of dispersion

making it not suitable for compression. The choice of

compression granularity has been tended to small

granularity compression strategy, but the small granularity

strategy needs to count the statistical information of each

sector, which result into a high calculation cost. Since there

has not been much interest in the selection of compression

scheme , the compression rate cannot be guaranteed.

In order to solve the problem of large data distribution

dispersion in the existing column storage database and the

complicated learning cost , we have done a detailed study

and made the following contribution:

1) A method of sorting data prior to compression was

proposed. The columns are split, the structure of the

columns are split, the data can be sorted in accordance with

the order of the stored in the region and avoid hot-spot

issues, so that the data can be arranged closely to minimize

the data in the local data Differences in distribution.

2) After making a scrutiny of lightweight and

heavyweight compression schemes, the compression

schemes suitable for HBase have been selected.

3) A sorted-based compression strategy selection

method is proposed. This method takes advantage of the

compact feature of compressed data, and uses different

compression scheme in the light of different data

characteristics, which achieves good compression rate while

ensuring low computation cost.

The rest of this paper is organized as follows: Section 2

introduces prior efforts on column-oriented storage

compression strategy. A novel Sorted-based compression

strategy selection method is proposed in Section 3. An

experimental evaluation of the performance of the method is

given in section 4. and section 5 concludes this paper.

II. R

ELATED WORK

Research on the strategy of column storage compression

strategy began with a compression sub-system on C-store

[1]. J. Abadi et al. [2] proposed a column compression

model based on decision tree. The compression model

established a compression scheme decision tree to choose

the best compression scheme for each column. However,

this method ignored the impact of the local data features and

the distribution of the data on compression.

Wang Zhenxi et al. [3] proposed a sector-based

compression strategy, dividing the data by sectors, choosing

the most suitable compression scheme based on the

correlation and difference between partitions. This method

can choose different compression schemes according to the

properties of different sectors to ensure the compression rate,

but if the similarity between sectors are too large, it will

result in a large amount of computation.

Idreos et al. [4] proposed a dynamic selection strategy of

compression scheme based on Bayesian classification.

compression schemes were chosen for different data sectors

by Bayesian formula to get the best compression effect.

However, the accuracy of this method depends greatly on

the training samples, ands does not establish an evaluation

layer to evaluate the compression scheme based on the

feedback results.

Wang Haiyan et al. [5] proposed a compression strategy

selection method based on hot and cold data classification.

Firstly, based on the frequency of data access, the HBase

data was divided into hot and cold data. Secondly, a new

compression classification method was proposed by

combing an evaluation-layer-added Bayesian classification

compression with the sector-based compression strategy.

International Conference on Big Data and Artificial Intelligence

2018

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38745434

粉丝: 14
资源: 922

HBase列式存储压缩策略排序优化：性能与效率提升

HBase列存储优化：基于排序的混合压缩策略

HBase数据冷热分类压缩策略，提升存储与查询效率

HBase数据冷热分类下的压缩策略优化方法

HBase 数据库检索性能优化策略

HBase网络社区海量数据存储优化：预分区与散列策略

HBase RowKey设计与优化策略

HBase深度解析：分布式列式存储原理与实战

HBase的数据压缩与性能优化

HBase数据存储：列式数据库设计原则与性能调优策略

HBase数据压缩与存储优化：HFile和BlockCache深入理解

最新资源