关系数据库中的连接操作优化概述

需积分: 9 49 浏览量更新于2024-07-21 收藏 4.22MB PDF 举报

在关系型数据库中，"Join Processing" 是一项核心查询操作，对于理解数据集成和数据仓库至关重要。它允许从两个不同的关系（表）中基于它们属性间的关联性检索信息，通过计算两个关系的笛卡尔积。这种操作对于网络和层次结构系统来说可能相对直观，因为这些系统预先假设了实体之间的链接，但在关系数据库中，由于数据独立性原则，这种链接是动态依赖于查询条件的。 join 的复杂性在于它没有预定义的连接路径，需要在运行时动态地匹配和合并来自不同关系的相关元组。这使得优化join操作成为一个挑战，因为频繁且成本高昂的性能优化是必不可少的。为了提高join效率，研究者们开发了多种方法和技术，如索引、分区、并行化、哈希连接、排序连接、嵌套循环连接等。首先，根据连接的方式，join可以分为几种类型： 1. 等值连接（Equality Join）：当两个关系中的键值完全匹配时执行，例如内连接（Inner Join）。 2. 不等值连接（Inequality Join）：涉及部分匹配，如外连接（Outer Join），包括左外连接（Left Join）、右外连接（Right Join）和全外连接（Full Join）。 3. 自连接（Self Join）：当一个关系与自身进行连接，用于处理具有层级或递归结构的数据。 4. 自然连接（Natural Join）：基于公共属性进行连接，如果两个关系有相同的键，则自动执行等值连接。其次，实现技术方面： - 索引：使用B-树、哈希索引等加速查找匹配的记录。 - 分区：将大表分割成小块（分区），减少数据扫描范围，提高查询速度。 - 并行化：通过分布式计算，将join任务分解到多个处理器或节点上，提高吞吐量。 - 优化器算法：如选择性估计、代价模型、规则和策略，用于决定最佳join顺序和连接方法。 - 缓存策略：存储频繁使用的中间结果，避免重复计算。此外，随着大数据和云计算的发展，现代数据库系统还在探索更高级的join技术，如MapReduce模型下的分布式join、使用NoSQL存储引擎的非传统join方法，以及利用GPU加速的并行join。 Join Processing在关系型数据库中扮演着关键角色，优化它的性能不仅影响数据查询的响应时间，还对整个系统的可扩展性和资源利用率有着深远影响。深入理解join的工作原理及其优化策略对于数据库管理员、数据分析师和软件开发者来说都是必要的技能。

Join Process ingin Relational Databases ● 71

In other words, the result relation Q is

the maximal subset of the relation mCJR)

such that the Cartesian product of rela-

tions Q and S is contained in R,

To illustrate the division operation, let

us redefine the relations.

Relation R

A B

Al B1

A2 B1

A3 B1

Al B2

A2 B2

A3 B2

Al B3

Relation S

Then, the result relation Q is as follows:

Relation S = R + S

Several hash-based algorithms for per-

forming division are presented in Graefe

[19891. An implementation of the divi-

sion operation on a shuffle-exchange net-

work is described in Baba et al. [1987].

IMPLEMENTATIONS OF JOINS

The techniques and methods used to im-

plement joins are discussed in the follow-

ing sections. Unless otherwise noted, the

algorithms are used to implement the

theta-join. The description of a method

includes the basic algorithm, general

discussion of the method, special data

structures (if any), and applicability and

performance of the technique. General

problems that apply to a whole class of

join techniques, such as the effect of clus -

tering or the effect of collisions on the

hash join techniques, are discussed sepa-

rately in Section 7.

2.1 Nested-Loops Join

Nested-loops join is the simplest join

method. It follows from the definition of

the join operation. One of the relations

being joined is designated as the

inner

relation,

and the other is designated as

the outer

relation. For each tuple of the

outer relation, all tuples of the inner re-

lation are read and compared with the

tuple from the outer relation. Whenever

the join condition is satisfied, the two

tuples are concatenated and placed in the

output buffer.

Algorithm

The algorithm for performing

Rb-d

?-(a)es(b)

is as follows:

for each tuple s do

{ for each tuple r do

{ if

r(a) Os(b) then

concatenate r and s

place in relation Q }}

Note that for efficiency, the relation with

higher cardinality (R in this case) is cho-

sen to be the inner relation.

Discussicm

In practice, a nested-loops join is imple-

mented as a nes~ed-block join; that is,

tuples are retrieved in units of blocks

rather than individually [E1-Masri and

Navathe 1989], This implementation can

be briefly described as follows. The inner

relation is read one block at a time. The

number of main memory blocks available

determines the number of blocks read

from the outer relation. Then all tmples

in the inner relation’s block are joined

with all the tuples in the outer relation’s

blocks. This process is repeated with all

blocks of the inner relation before the

next set of outer relation blocks is read

ACM Computing Surveys, Vol 24, No. 1, March 1992

72 ●

P. Mishra and M. H. Eich

in. The amount of reduction in 1/0 activ-

ity (compared to a simple tuple-oriented

implementation) depends on the size of

the available main memory.

A further step toward efficiency con-

sists of “rocking” the inner relation [Kim

1980]. In other words, the inner relation

is read from top to bottom for one tuple of

the outer relation and bottom to top for

the next. This saves on some 1/0 over-

head since the last page of the inner

relation which is retrieved in one loop is

also used in the next loop.

Performance

In the above algorithm, it is seen that

each tuple of the inner relation is com-

pared with every tuple of the outer

relation. Therefore, the simplest imple-

mentation of this algorithm requires 0(

x m)

time for execution of joins.

The block-oriented implementation of

the nested-loops join optimizes on 1/0

overhead in the following way. Since the

inner relation is read once for each tuple

in the outer relation, the operation is

most efficient when the relation with the

lower cardinality is chosen as the outer

relation. This reduces the number of

times the inner loop is executed and, con-

sequently, the amount of 1/0 associated

with reading the inner relation, An anal-

ysis of buffer management for the

nested-loops method with rocking shows

that buffering an equal number of pages

for both relations is the best strategy

[Hagmann 1986].

If the join attributes can be accessed

via an index, the algorithm can be made

much more efficient, Such an implemen-

tation has been described in Blasgen and

Eswaran

[1977].

Applicability

The exhaustive matching performed in

this method makes it unsuitable for join-

ing large relations unless the

j’oin selec-

tivity factor, the ratio of the number of

tuples in the result of the join to the total

number of tuples in the Cartesian prod-

uct, is high. If the selectivity factor is

low, the effort of comparing every tuple

in one relation with every tuple in the

other is further unjustified.

The simplicity of this algorithm has

made it a popular choice for hardware

implementation in database machines

[Su 1988]. It has been found that this

algorithm can be parallelized with great

advantage. The parallel version of this

algorithm is found to be more efficient

than most other methods. Thus, we see

that for the nested-loops join, a parallel

implementation of an inefficient serial

algorithm looks good. More details con-

cerning the parallel approach can be

found in Section 6.

This algorithm is also chosen in a pro-

posed model for main memory databases

called the DBGraph storage model

[Pucheral et al. 1990]. The entire

database is represented in terms of a

graph-based data structure called the

DBGraph. A set of primitive operations

is defined to traverse the graph, and all

database operations can be performed us-

ing these primitive operations. Advan -

tages of this model are efficient process-

ing of all database operations and

complex queries, compact storage, and

uniform treatment of permanent and

transient data.

2.2 Sort-Merge Join

The sort-merge join is executed in two

stages. First, both relations are sorted on

the join attributes. Then, both relations

are scanned in the order of the join at-

tributes, and tuples satisfying the join

condition are merged to form a single

relation. Whenever a tuple from the first

relation matches a tuple from the second

relation, the tuples are concatenated and

placed in the output relation.

Algorithm

The exact algorithm for performing a

sort-merge join depends on whether or

not the join attributes are nonkey at-

tributes and on the theta operator. In all

cases, however, it is necessary that the

two relations be physically ordered on

their respective join attributes.

ACM Computing Surveys, Vol 24, No 1, March 1992

Join Process ingin Relational Databases ● 73

The algorithm for performing equijoins

is as follows:

Stage 1: Sort process

sort R

on r(a);

sort S on s(b);

Stage 2: Merge process

read first tuple from R;

read first tuple from S’;

for each tuple

r do

{ while s(b) <

r(a)

read next tuple from S;

r(a) = S(b) then

join

r and s

place in output relation Q };

Discussion

The merge process varies slightly de-

pending on whether the join attributes

are primary key attributes, secondary

key attributes, or nonkey attributes. If

the join attributes are not the primary

key attributes, several tuples with the

same attribute values may exist. This

necessitates several passes over the same

set of tuples of the inner relation. The

process is described below.

Let there be two tuples, rl and r2, in

R that have a given value x of the join

attribute r(a) and three tuples, s1, s2,

and s3, in S that have the same value x

of the join attribute S(b). If the above

join algorithm is used then when r2 is

the current tuple in R, the current tuple

in S would be the tuple following s3.

Now the result relation must also in-

clude the join of r2 with s1, s2, and s3.

To achieve this, the above algorithm must

be modified to remember the last r(a)

value and the point in S where it started

the last inner loop. Whenever it encoun-

ters a duplicate r(a) value, it backtracks

to the previous starting point in S. This

backtracking can be especially expensive

in terms of the 1/0 if the set of tuples

does not fit into the available main mem-

ory and the tuples have to be retrieved

from secondary storage for each pass.

Performance

If the relations are presorted, this algo-

rithm has a major advantage over the

brute force approach of the nested-loops

method. The advantage is that each rela-

tion is scanned only once. Further, if the

join selectivities are low, the number of

tuples compared is considerably lower

than in the case of the nested-loops join.

It has been shown that this algorithm is

most efficient for processing on a unipro -

cessor system [Blasgen and Eswaran

1977].

The processing time depends on the

sorting and merging algorithms used. If

the files are already sorted on the join

attributes, the cost is simply the cost of

merging the two relations. In general,

the overall execution time is more depen-

dent on the sorting time, which is usu-

ally 0(

n log n) for each relation, where

n is the cardinality of the relation.

Execution is further simplified if the

join attributes are indexed in both rela-

tions. The Simple TID algorithm starts

by scanning the join attribute indices and

making a list of tuple-id pairs corre-

sponding to the tuple pairs that partici-

pate in the join [Blasgen and Eswaran

1977]. In the next stage, the tuples them-

selves are fetched and physically joined.

This approach reduces the number of tu-

ples read into main memory and, as a

result, the amount of 1/0 needed. If the

index is not the primary index, however,

retrieval of the records may be rather

inefficient [E1-Masri and Navathe 1989].

Applicability

If no indexes exist on the join attributes,

if not much is known about the select ivi -

ties, and if there is no basis for choosing

a particular join algorithm, then this al-

gorithm is often found to be the best

choice [Blasgen and Eswaran 1977; Su

1988].

With the help of hardware sorters, this

algorithm makes a good candidate for

hardware implementation. Several

database machines, such as VERSO

[Bancilhon et al. 19831 use this as the

primary join method.

The sort-merge join algorithm can also

be used to implement the full-outerjoin.

The algorithm for performing the

ACM Computing Surveys, Vol 24, No 1, March 1992

剩余50页未读，继续阅读

_Focus_

粉丝: 1135
资源: 441

关系数据库中的连接操作优化概述

SIGMOD 2009 全部论文（1）

Database Processing Fundamentals, Design, and Implementation (12th Edition).rar

Manning.Spring.in.Action.4th.Edition.2014.11.epub

Unveiling Doris Database: The Secret Weapon of the New Generation of Distributed Databases

Performance Tuning and Optimization Strategies in DBeaver

[In-depth Analysis of ORM]: Mastering the Art of Interaction Between SQLAlchemy and MySQL

C2000，28335Matlab Simulink代码生成技术，处理器在环，里面有电力电子常用的GPIO，PWM，ADC，DMA，定时器中断等各种电力电子工程师常用的模块儿，只需要有想法剩下的全部自

OpenArk64-1.3.8beta版-20250104

面向对象（下）代码.doc

基于springboot的校园台球厅人员与设备管理系统--论文.zip

最新资源