列存储优化：物化策略与查询性能

需积分: 10 99 浏览量更新于2024-09-15 收藏 327KB PDF 举报

"本文探讨了列存储数据库管理系统中物化策略的重要性，特别是在处理读取密集型查询工作负载时的优势。作者Daniel J. Abadi、Daniel S. Myers、David J. DeWitt 和 Samuel R. Madden 来自MIT和UWMadison，他们在2007年ICDE会议上的论文中深入研究了这一主题。文章讨论了如何通过物化策略优化列存储数据库的性能，以提高数据分析和检索效率。" 在传统的行存储数据库系统中，数据以行的形式组织，这在处理涉及多列的复杂查询时可能会导致性能下降，因为需要读取不必要的行数据。相比之下，列存储数据库将每个表的列分开存储，使得针对单个或少数列的查询可以更高效地执行，因为它只需访问所需列的数据，减少了I/O操作。物化策略是列存储数据库中的一种关键优化手段，它涉及到预先计算和存储查询结果，以便于后续快速检索。这种策略可以极大地提高数据仓库和分析系统的响应时间，尤其适用于那些频繁运行的聚合查询。文章中可能详细讨论了多种物化策略，包括： 1. **部分物化视图**：只物化查询的一部分结果，以节省存储空间并减少维护成本。 2. **完全物化视图**：预先计算整个查询结果并存储，提供最快速的查询响应，但需要更多的存储空间。 3. **增量物化**：随着数据更新，仅物化新数据或变化的数据，保持物化视图的时效性。 4. **多版本并发控制**：在列存储中，物化视图可能需要支持多个版本，以适应并发查询和更新操作。此外，文章可能还讨论了如何选择合适的物化策略，考虑因素可能包括查询模式、数据更新频率、存储资源限制以及系统整体性能目标。作者可能提出了评估和决策这些策略的方法，以及如何根据工作负载动态调整物化策略。列存储数据库的物化策略与传统的行存储数据库中的缓存和索引机制有显著区别，它更侧重于数据分析而非事务处理。因此，对于大数据分析、商业智能和数据挖掘等场景，理解并有效利用列存储的物化策略至关重要。这篇论文对列存储数据库中的物化策略进行了深入探讨，为数据库管理员和系统设计者提供了有价值的理论基础和实践指导，帮助他们优化数据存储和查询性能，以满足现代数据密集型应用的需求。

lists are small, highly-compressible data structures that can

be operated on directly with very little overhead. For exam-

ple, 32 (or 64 depending on processor word size) positions

can be intersected at once when ANDing together two posi-

tion lists represented as bit-strings. Note, however, that one

problem with this late materialization approach is that it re-

quires re-scanning the base columns to form tuples, which

can be slow (though they are likely to still be in memory

upon re-access if the query is properly pipelined).

The main contribution of this paper is not to introduce

new materialization strategies (as described in the related

work, many of these strategies have been used in other

column-stores). Rather, it is to systematically explore the

trade-offs between different strategies and provide a foun-

dation for choosing a strategy for a particular query. We

focus on standard warehouse-style queries: read-only work-

loads with selections, aggregations, and joins. We extended

the C-Store column-oriented DBMS [14] with a variety of

materialization strategies, and experimentally evaluate the

effects of varying selectivities, compression techniques, and

query plans on these strategies. Further, we provide a model

that can be used (for example) in a query optimizer to se-

lect a materialization strategy. Our results show that, on

some workloads, late materialization can be an order of

magnitude faster than early-materialization, while on other

workloads, early materialization outperforms late material-

ization.

The remainder of this paper is organized as follows. Sec-

tion 2 gives a brief overview of the C-Store query executor.

We illustrate the trade-offs between materialization strate-

gies in Section 3 and then present both pseudocode and an

analytical model for example query plans using each strat-

egy in Section 4. We validate our models experimentally

(using a version of C-Store we extended) in Section 5. Fi-

nally, we describe related work in Section 6 and conclude

in Section 7.

2 The C-Store Query Executor

We chose to use C-Store as the column-oriented DBMS

to extend and experiment with since we were already famil-

iar with the source code. Since this is the system in which

the various materialization strategies were implemented for

this study, we now provide a brief overview of the relevant

details of the C-Store query executor, which is more fully

described in previous work [4, 14] and available in an open

source release [2]. The components of the query execu-

tor most relevant are the on-disk layout of data, the access

methods provided for reading data from disk, the data struc-

tures provided for representing data in the DBMS, and the

operators for manipulating data.

Each column is stored in a separate ﬁle on disk as a se-

ries of 64KB blocks and can be optionally encoded using a

variety of compression techniques. In this paper we exper-

iment with column-speciﬁc compression techniques (run-

length encoding and bit-vector encoding) and with uncom-

pressed columns. In a run-length encoded ﬁle, each block

contains a series of RLE triples (V, S, L), where V is the

value, S is the start position of the run, and L is the length

of the run.

A bit-vector encoded ﬁle representing a column of size

n with k distinct values consists of k bit-strings of length n,

one per unique value, stored sequentially. Bit-string k has a

1 in the i

position if the column it represents has the value

k in the i

position.

C-Store provides an access method (or data source) for

each encoding type. All C-Store data sources support two

basic operations: reading positions from a column and read-

ing (position, value) pairs from a column. Additionally,

all C-Store data sources accept predicates to restrict the set

of results returned. In order to minimize CPU overhead, C-

Store data sources and operators are block-oriented. Data

sources return data from the underlying ﬁles as blocks

of encoded data, wrapped inside a C++ object that pro-

vides iterator-style (hasNext() and getNext() methods) and

vector-style [8] (asArray()) access to the data.

In Section 4.1 we give pseudocode for the C-Store op-

erators relevant to this paper: DataSource (Select), AND,

and Merge. These operators are used to construct the query

plans we experiment with. We also describe the Join op-

erator in Section 5.3. The DataSource operator reads in a

column of data and produces the column values that pass

a predicate. AND accepts input position lists and pro-

duces an output position list representing their intersec-

tion. Finally, the n -ary Merge operator combines n lists

of (position, value) pairs into a single output list of n-

attribute tuples.

3 Materialization Strategy Trade-offs

In this section we present some of the trade-offs that are

made between materialization strategies. A materialization

strategy needs to be in place whenever more than one at-

tribute from any given relation is accessed (which is the case

for most queries). Since a column-oriented DBMS stores

each attribute independently, it must have some mechanism

for stitching together multiple attributes from the same log-

ical tuple into a physical tuple. Every proposed column-

oriented architecture accomplishes this by attaching either

physical or virtual tuple identiﬁers or positions to column

values. To reconstruct a tuple from multiple columns of

a relation, the DBMS simply needs to ﬁnd matching po-

sitions. Modern column-oriented systems [7, 8, 14] store

columns in position order; i.e., to reconstruct the i’th tuple,

one uses the i’th value from each column. This accelerates

the tuple reconstruction process.

As described in the introduction, tuple reconstruction

can occur at different points in a query plan. Early materi-

alization (EM) constructs tuples as soon as (or sometimes

before) tuple values are needed in the query plan. Late

剩余10页未读，继续阅读

chengsezhiyi

粉丝: 1
资源: 4

列存储优化：物化策略与查询性能

行业分类-设备装置-按列存储环境下分布式系统中物化视图布局及其维护方法.zip

大数据分析系统列存储物化策略研究

列存储环境下分布式系统物化视图维护技术研究

宽表列存储在大数据分析中的应用与优化.pdf

oracle的物化视图

ORACLE9I物化视图

Oracle OCM考试实验：物化视图的创建与刷新策略

Oracle数据仓库分区、维度与物化视图三大优化策略详解

列存储数据库：技术现状与关键优势

物化视图快速刷新条件与限制

最新资源