优化宽表布局：列顺序与重复策略

54 浏览量更新于2024-07-14 收藏 1.58MB PDF 举报

"基于列顺序和重复的宽表布局优化" 这篇研究论文主要探讨了在大数据分析背景下，如何通过优化列顺序和重复来提升宽表（Wide Table）在存储和查询性能上的表现。宽表通常拥有几百到几千列，是数据分析任务中的常见数据结构。虽然列存储（Column Store）被认为是处理宽表和分析工作负载的理想数据格式，但论文指出，列的物理顺序对I/O性能的影响尚未得到充分研究。论文中提到，列的顺序至关重要，因为在宽表中访问单个水平分区的列可能涉及到多次磁盘寻道。理想的列顺序可以最小化一系列查询应用到数据时的累积磁盘寻道成本，从而最大化I/O性能。为此，作者们专注于研究列存储在HDFS（Hadoop Distributed File System）上的两个问题：列顺序优化和列重复。列顺序优化（Column Ordering）旨在寻找一种最优的列排列方式，以降低I/O操作的成本。通过对列的不同顺序进行排列组合，可以找到一个能够最小化磁盘寻道总数的排列，从而提高数据读取速度和查询效率。另一方面，列重复（Column Duplication）是指在特定条件下复制某些列以减少I/O。这可能是为了减少跨磁盘的访问，或者是为了在查询中频繁使用的列上提供更快的本地访问。通过智能地选择和复制关键列，可以进一步优化I/O性能，尤其是在分布式环境中，如HDFS，其中网络延迟可能成为性能瓶颈。论文的贡献在于提出了新的优化策略，并通过实验验证了这些策略在实际场景中的效果。作者们可能采用了数学模型和算法来解决这两个问题，比如使用贪心算法或动态规划来寻找最佳列顺序，以及基于数据访问模式和存储成本的分析来决定哪些列应该被复制。这篇论文对于大数据分析和数据库系统领域的从业者具有重要的参考价值，它提供了关于如何通过列顺序和重复来优化宽表布局的具体方法，以提升整体的系统性能。这些优化技术对于处理大规模数据集和复杂查询的工作负载尤其有用，能够有效减少计算资源的消耗，提高数据分析的速度和效率。

Algorithm 1: SCOA

Input: The set of queries Q = {q

, q

, ..., q

};

The initial column order S

= {c

, c

, ..., c

}

Output: The optimized column order S;

1 S := S

, e := Cost(Q, S

), t := t

;

2 for k := 1 to k

max

3 t := T emperature(t, cooling_rate);

4 S

:= Neighbor(S);

5 e

:= Cost(Q, S

);

6 if (e

< e)||(exp((e − e

)/t) > random(0, 1)) then

7 S := S

;

8 e := e

;

9 return S;

proposed in [35]. The Temperature function is the core function of

the annealing schedule. In this algorithm, the temperature shrinks at

a rate of

(1 − cooling_rate)

. Function Neighbor(S) is to generate

a candidate neighboring state from the current state

, achieved

by swapping the positions of two randomly picked columns in

Parameter settings of SCOA are discussed in Appendix C.

3.3 Incremental Computation of Seek Cost

When the access pattern of a query follows the global column

order (as adopted by existing systems such as HDFS), we can incre-

mentally compute the seek cost of a query to speed up SCOA, given

that a neighboring state

is derived from the current state

randomly swapping two columns. Consider the example in Figure 2.

Query

accesses 4 columns

= {c

, c

}

. When deriving

a new state by swapping two columns in

(e.g.,

and

Figure 2(a)), the seek cost of this query clearly remains unchanged

(both equal to

f(s(c

)) + f(s(c

) + s(c

)) + f(s(c

) + s(c

))

for reading a row group).

(a) Swap c

and c

(b) Swap c

and c

Figure 2: Three cases of the delta query cost

A more complex case occurs when neither of the two swapped

columns is accessed by the query

(e.g.,

and

in Figure 2(b)).

The pseudo code for handling this case is presented in Algorithm 2.

The

SeekCost2ndCase

function takes as input the current state

and two swapped columns

and

, and outputs the seek cost of

the neighboring state

for

. Let

suc(c

)

be the ﬁrst succeeding

column of

, and

pre(c

)

be ﬁrst preceding column of

. For example, in Figure 2,

suc(c

) = c

and

pre(c

) = c

According to Algorithm 2, it is clear that

Cost(q, S

) = Cost(q, S)

suc(c

) = suc(c

)

. Otherwise, at most two terms in Equation 2

We have also tested various other neighboring state selection heuris-

tics, including substantially more complicated ones. However, none

of them outperformed the simple ‘column-swap’ heuristic. For the

sake of simplicity, we thus limit ourselves to the presentation of this

most basic version of the algorithm.

Algorithm 2: SeekCost2ndCase

Input: A query q and sorted set C

;

Current column order S, and its seek cost Cost(q, S);

Two swapped columns, c

/∈ C

and c

/∈ C

Output: The seek cost of the neighboring state S

Cost(q, S

)

1 if suc(c

) = suc(c

) then

2 return Cost(q, S);

3 delta := 0;

4 if pre(c

) 6= null and suc(c

) 6= null then

5 delta −= f (b(suc(c

)) − e(pre(c

)));

6 delta += f(b(suc(c

)) − e(pre(c

)) − s(c

) + s(c

))

;

7 if pre(c

) 6= null and suc(c

) 6= null then

8 delta −= f (b(suc(c

)) − e(pre(c

)));

9 delta += f (b(suc(c

)) − e(pre(c

)) − s(c

) + s(c

));

10 return Cost(q, S) + delta;

will be affected and it will be updated according to Lines 4-6 and

Lines 7-9, respectively.

The last case occurs when exactly one swapped column is ac-

cessed by

(e.g. Figure 2(c)), which can be handled in a similar

way to Algorithm 2. An important difference from the previous two

cases is that

will be updated if the SA algorithm accepts this

neighboring state S

Time Complexity.

To maintain the sorted set

efﬁciently,

we use a binary balanced search tree to insert, remove and query

preceding and succeeding elements. All these operations run in

O(log R)

time. The overall time complexity of computing seek

costs is

O(|Q|· log R)

, where

is the average number of columns

accessed by a query. Compared to the naive approach of sorting all

the columns for every new ordering, this incremental approach is

times faster. On the production data we tested,

is 32 and SCOA

with incremental seek cost computation only requires a few minutes

to converge.

Besides simulated annealing (SA), we have tried several other

meta heuristics. Particularly, we have also tried to apply genetic

algorithm (GA) [37, 51] in Appendix D and AutoPart [41] algorithm

in Appendix E. Results show that SA performs much better.

4. STORAGE CONSTRAINED COLUMN DU-

PLICATION

Suppose we have extra storage headroom, we may be able to

further reduce the overall seek cost by duplicating some popular

columns and inserting them into carefully selected positions within

the derived column orders. Consider the simple example in Figure

3. In Fig. 3(a), the seek cost of both

and

is 0 while the seek

cost of

f(s(c

) + s(c

))

(Note that the initial seek cost



can

be ignored as it is constant). In Fig. 3(b), however, if we duplicate

, insert it between

and

, and let

access the new replica of

, the seek cost of all three queries becomes 0.

We formally deﬁne the column duplication problem as follows.

DEFINITION 7 (COLUMN DUPLICATION PROBLEM).

Given a workload

and the storage headroom

, identify a set of

duplicated columns with an ordering strategy

such that 1) the

total size of duplicated columns is not greater than

and 2) the

seek cost of Q is minimized.

In this section, we ﬁrst introduce the basic idea of the duplication

process in Section 4.1 and then provide details of how to optimize it

in Section 4.2.

剩余15页未读，继续阅读

weixin_38654315

粉丝: 5
资源: 962

优化宽表布局：列顺序与重复策略

【优化布局】遗传算法求解作业车间布局优化问题（最小成本设计）【含Matlab源码 2395期】.zip

【选址优化】基于粒子群算法求解配电网抢修选址优化问题含Matlab源码.zip

ORACLE性能优化31条

基于工件顺序和机器分配的双层编码是离散型，如何应用到求解连续优化问题的鲸鱼优化算法中

试阐述顺序布局和网格布局的布局策略

两个有序顺序表合并为一个顺序表,顺序表内元素不重复

基于顺序表，设计一套图书管理系统，读取book.txt中的信息，并实现顺序表的初始化、顺序表的取值、顺序表的查找、顺序表的插入】顺序表的删除

基于顺序存储结构的图书信息表的逆序存储

用C++写基于顺序表的顺序查找

数据结构基于顺序查找表实现顺序查找以及二分查找

最新资源