动态表：云存储中的稀疏与密集数据解决方案

21 浏览量更新于2024-08-26 收藏 310KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇论文提出了一个名为'动态表'的三层存储结构，旨在解决大数据背景下稀疏数据和密集数据的存储挑战。动态表设计适用于云环境的分布式存储，支持行和列的混合布局，允许用户根据需要混合和匹配两种物理存储格式。论文还介绍了四值逻辑来处理缺失值的原始语义，这对于理解和分析稀疏数据集中的缺失值至关重要。通过实验，动态表方法显示出了结合列式存储和行式存储优点的能力。" 在大数据时代，数据的快速增长和多样性带来了新的存储需求。传统的密集数据结构不再足够处理日益增多的稀疏数据。稀疏数据，即包含大量空值或缺失值的数据，已经成为海量数据的主要组成部分。面对这一挑战，"动态表"提出了一种创新的解决方案。动态表采用了三层的存储结构，这种设计允许灵活地处理不同类型的数据。底层存储密集数据，中层处理稀疏数据，顶层则负责管理和配置这两层之间的交互。这样的分层结构使得系统能够同时优化对密集数据和稀疏数据的访问效率，满足不同应用的需求。为了处理缺失值，论文引入了四值逻辑，这是对二值逻辑（存在与不存在）的扩展。在传统二值逻辑中，数据要么存在要么不存在，但在稀疏数据中，缺失值可能有多种含义，如未收集、无效或者未知。四值逻辑区分了这些不同的缺失情况，提供了更精确的数据表示，这对于数据分析和挖掘尤其重要。动态表的另一个关键特性是其混合布局，它支持行式存储和列式存储的组合。行式存储适合频繁的整行访问，而列式存储则有利于聚合操作和统计分析。在云环境中，这种混合布局可以根据任务需求动态调整，提高了存储系统的灵活性和性能。通过在合成数据集和真实数据集上的实验，动态表证明了其能够有效地结合行式存储的高效访问和列式存储的分析优势。实验结果表明，动态表在处理稀疏数据时表现出了优秀的性能，同时没有牺牲对密集数据的支持。 "动态表：云中的分层可配置存储结构"提供了一个创新的存储模型，以适应大数据场景下的多样化数据需求，特别是处理稀疏数据时的缺失值问题。这种结构对于云存储和数据分析领域的研究具有重要的参考价值。

资源详情

资源推荐

206 X. Cheng et al.

differentiated semantic of absent values and a 4-valued logic are introduced. We

present an experimental comparison of our model to several open source data stores in

section 5. Finally, the Section 6 discusses our conclusions and future work.

2 Related Work

Due to the tremendous increase in the scale of generated sparse data, various methods

have been proposed in order to model incomplete data. Based on the adopted logical or

physical layout, they can be grouped into two types:

⑴

The first category tries to

capture the sparse characteristic with distinct logic structure. The native idea is

to decompose a sparse table into a number of smaller and denser tables. One way is to

store a few “dense” attributes that most rows defined in a horizontal table, then

relegates the rest of attribute-value pairs to a large text file. It is on the premise that the

distribution of the non-null values must conform to this multi-table schema[3]. Another

way is to use the 3-ary vertical representation[4]. A single row in a horizontal table is

split into as many rows as the number of non-null attributes. The schema evolution

is just an addition and deletion of a row. However writing SQL queries on vertical table

is much more difficult than on the horizontal table. Decomposed storage model (DSM)

[13] and BigTable [5] are also of this kind.

⑵

The second type try to decouple the

logical and physical storage of entities. It remains the logical relational schema of upper

layer unchanged and tackles the problem by way of delicate physical storage strategy.

For the row-oriented storage[2], we can either use a placeholder to replace each

appearance of the absent values or omit the missing values all together. While the

former option is a waste of space, the latter one slows down tuple random access[21].

Positional format and interpreted attribute layout[1] belong to this kind. By contrast, in

the column-oriented storage, such as C-Store[11], MonetDB[10], the popular way to

deal with NULL is to treat it as a special value and resort to different compression

methods to compress the values[3]. While improving query efficiency and facilitating

schema evolution, this method generally suffers from higher cost of inserts(tuple

fragmentation) and record reconstruction[21]. In addition, other works[7][8][9] make

choice of an eclectic method to combine the formats of NSM and DSM, also known as

hybrid representation. For a given relation, PAX[8] stores the same data on each page

as NSM. Within each page, however, it groups all the values of a particular attribute

together on a minipage. Similarly, RCFile[7] applies the concept of “first

horizontally-partition, then vertically-partition” as well. With the complicated internal

structure, it suffers from high overhead of schema evolution and serves as a storage

structure for the almost read-only data warehouse system.

3 The Layered and Configurable Storage Structure – Dynamic

Table

To address the problem of sparse and dense data representation in the cloud, we resort

to a layered and configurable storage structure to describe the dataset. As illustrated in

Fig. 1, the storage structure is composed of 3 layers. While the upper two layers give

the logical layout of a dataset, the underlying layer defines the physical storage format.

When we said the Storage Structure is configurable, we do mean that the tabular

剩余11页未读，继续阅读

weixin_38717574

粉丝: 14
资源: 925

动态表：云存储中的稀疏与密集数据解决方案

h3c认证云计算工程师题库.pdf

Glusterfs-guide

华为云中的DDS实例一般由哪些部分组成

前端开发，有没有什么在线存储的免费数据库

OpenStack的基本框架结构和应用情况

pcl::Correspondences如何创建

阿里云的oss存储和nas存储有什么不通

在华为云中下列哪个不属于包文件仓库? 源码仓库 本地仓库 私服仓库 中央仓库

容器云配置导入文件的编码格式

android studio数据存储技术

FastGICP怎么配置到c++程序中

云中台和SaaS有什么区别

openstack 有哪些API

QT从巴法云中读取图片

云计算技术在存储系统中的应用

pcl::pointindices

mongodb的使用场景

基于大模型技术的算力产业监测服务平台设计

This_honeypot_supports_Telnet_and_SSH_two_protocol_FF-Pot.zip

最新资源

在华为云中下列哪个不属于包文件仓库? 源码仓库本地仓库私服仓库中央仓库