LSM存储技术：一项综合调查

需积分: 3 158 浏览量更新于2024-07-16 收藏 719KB PDF 举报

"LSM-survey.pdf 是一篇关于LSM（日志结构合并树）存储技术的调查论文，全面概述了存储工程师需要了解的所有LSM相关技术，强烈推荐阅读。" LSM（Log-Structured Merge-tree）是一种广泛应用于现代NoSQL系统存储层的数据结构。它的设计目标是优化磁盘I/O性能，尤其适用于写密集型工作负载。近年来，随着NoSQL数据库的普及，对LSM树的研究和优化成为了数据库和操作系统社区关注的重点。这篇由Chen Luo和Michael J. Carey等人撰写的预印本论文对LSM树的相关研究进行了详尽的调查，旨在让读者了解LSM存储技术的最新进展。论文首先介绍了LSM树的基本概念和工作原理：数据以日志形式写入，定期合并以减少随机读取的开销，从而提高读写效率。论文构建了一个通用的分类体系，将现有的LSM树文献分为多个类别，可能包括但不限于以下几类： 1. **LSM树的结构优化**：这部分研究可能涉及如何改进LSM树的层级结构，例如减少合并操作的频率，或者调整内存和磁盘之间的数据分布，以实现更高效的数据访问。 2. **压缩与编码**：为了减少存储空间和I/O操作，LSM树可能采用不同的数据压缩和编码技术。这部分可能涵盖了最新的压缩算法和其对性能的影响。 3. **缓存策略**：LSM树的性能往往依赖于内存缓存的管理。论文可能会讨论如何优化缓存策略，如使用LRU（Least Recently Used）或LFU（Least Frequently Used）策略，以及最近的智能缓存算法。 4. **并发控制与多线程**：在多核处理器环境中，如何并行化LSM树的操作以提高性能是关键问题。这部分可能涵盖并发插入、读取和合并操作的优化方法。 5. **故障恢复与持久性**：LSM树如何保证在系统崩溃或断电后能够正确恢复，以及如何在不影响性能的情况下实现数据持久化，也是研究的重点。 6. **实时分析与查询优化**：对于支持分析查询的LSM树，如何优化查询性能，减少延迟，同时保持高吞吐量，是另一个挑战。 7. **可扩展性和分布式LSM**：随着数据量的增长，如何设计可扩展的LSM树架构，以及在分布式系统中部署和管理LSM树，也是研究的重要方向。通过这篇论文，读者可以系统地了解LSM树的最新研究成果，为设计、实现或优化存储系统提供理论依据。对于数据库和存储领域的从业者，它是一份非常有价值的学习资料。

4 Chen Luo, Michael J. Carey

Bloom ﬁlter and then proceed to search its B

-tree only

if its associated Bloom ﬁlter reports a positive answer. Al-

ternatively, a Bloom ﬁlter can be built for each leaf page

of a disk component. In this design, a point lookup query

can ﬁrst search the non-leaf pages of a B

-tree to locate the

leaf page, where the non-leaf pages are assumed to be small

enough to be cached, and then check the associated Bloom

ﬁlter before fetching the leaf page to reduce disk I/Os. Note

that the false positives reported by a Bloom ﬁlter do not im-

pact the correctness of a query, but a query may waste some

I/O searching for non-existent keys. The false positive rate

of a Bloom ﬁlter can be computed as (1 − e

−kn/m

)

, where k

is the number of hash functions, n is the number of keys, and

m is the total number of bits [18]. Furthermore, the optimal

number of hash functions that minimizes the false positive

rate is k =

ln2. In practice, most systems typically use 10

bits/key as a default conﬁguration, which gives a 1% false

positive rate. Since Bloom ﬁlters are very small and can of-

ten be cached in memory, the number of disk I/Os for point

lookups is greatly reduced by their use.

Partitioning. Another commonly adopted optimization

is to range-partition the disk components of LSM-trees into

multiple (usually ﬁxed-size) small partitions. To minimize

the potential confusion caused by different terminologies,

we use the term SSTable to denote such a partition, following

the terminology from LevelDB [4]. This optimization has

several advantages. First, partitioning breaks a large com-

ponent merge operation into multiple smaller ones, bound-

ing the processing time of each merge operation as well as

the temporary disk space needed to create new components.

Moreover, partitioning can optimize for workloads with se-

quentially created keys or skewed updates by only merging

components with overlapping key ranges. For sequentially

created keys, essentially no merge is performed since there

are no components with overlapping key ranges. For skewed

updates, the merge frequency of the components with cold

update ranges can be greatly reduced. It should be noted that

the original LSM-tree [52] automatically takes advantage of

partitioning because of its rolling merges. However, due to

the implementation complexity of its rolling merges, today’s

LSM-tree implementations typically opt for actual physical

partitioning rather than rolling merges.

An early proposal that applied partitioning to LSM-trees

is the partitioned exponential ﬁle (PE-ﬁle) [38]. A PE-ﬁle

contains multiple partitions, where each partition can be log-

ically viewed as a separate LSM-tree. A partition can be

further split into two partitions when it becomes too large.

However, this design enforces strict key range boundaries

among partitions, which reduces the ﬂexibility of merges.

We now discuss the partitioning optimization used in

today’s LSM-tree implementations. It should be noted that

partitioning is orthogonal to merge policies; both leveling

and tiering (as well as other emerging merge policies) can

level 0

level 1

level 2

0-100

0-30 34-70 71-99

0-15 16-32 35-50 51-70 72-95

0-10 11-19

20-32

SSTable

Merging SSTable

New SSTable

34-70 71-99

35-50 51-70 72-95

level 1

level 2

Before Merge

After Merge

0-100

level 0

0-100

Fig. 4: Partitioned leveling merge policy

be adapted to support partitioning. To the best of our knowl-

edge, only the partitioned leveling policy has been fully im-

plemented by industrial LSM-based storage systems, such

as LevelDB [4] and RocksDB [6]. Some recent papers [12,

50,58,76,79] have proposed various forms of a partitioned

tiering merge policy to achieve better write performance

In the partitioned leveling merge policy, pioneered by

LevelDB [4], the disk component at each level is range-

partitioned into multiple ﬁxed-size SSTables, as shown in

Figure 4. Each SSTable is labeled with its key range in the

ﬁgure. Note that the disk components at level 0 are not par-

titioned since they are directly ﬂushed from memory. This

design can also help the system to absorb write bursts since

it can tolerate multiple unpartitioned components at level 0.

To merge an SSTable from level L into level L +1, all of its

overlapping SSTables at level L + 1 are selected, and these

SSTables are merged with it to produce new SSTables still at

level L + 1. For example, in the ﬁgure, the SSTable labeled

0-30 at level 1 is merged with the SSTables labeled 0-15 and

16-32 at level 2. This merge operation produces new SSTa-

bles labeled 0-10, 11-19, and 20-32 at level 2, and the old

SSTables will then be garbage-collected. Different policies

can be used to select which SSTable to merge next at each

level. For example, LevelDB uses a round-robin policy (to

minimize the total write cost).

The partitioning optimization can also be applied to the

tiering merge policy. However, one major issue in doing so

is that each level can contain multiple SSTables with over-

lapping key ranges. These SSTables must be ordered prop-

erly based on their recency to ensure correctness. Two pos-

sible schemes can be used to organize the SSTables at each

level, namely vertical grouping and horizontal grouping. In

both schemes, the SSTables at each level are organized into

groups. The vertical grouping scheme groups SSTables with

overlapping key ranges together so that the groups have dis-

joint key ranges. Thus, it can be viewed as an extension

RocksDB supports a limited form of a partitioned tiering merge

policy to bound the maximum size of each SSTable due to its internal

restrictions. However, the disk space may still be doubled temporarily

during large merges.

LSM-based Storage Techniques: A Survey 5

0-100

0-30

0-31 34-72

75-100

74-100

0-13

16-32

35-45

35-50 51-70

75-95

72-95

0-12 17-31

level 0

level 1

level 2

SSTable

Merging SSTable

New SSTable

SSTable Group

0-100

34-72

75-100

74-100

0-13 16-32 35-45

35-50 51-70

75-95

72-95

level 0

level 1

level 2

Before Merge

After Merge

Fig. 5: Partitioned tiering with vertical grouping

35-70

35-65

72-100

67-99

0-15 19-30 32-50 52-75 80-100

0-100

0-20 22-30

level 0

level 1

level 2

SSTable

Merging SSTable

New SSTable

SSTable Group

35-52 53-70

72-100

67-99

0-15 19-30 32-50 52-75 80-100

0-100

0-20 22-30

Before Merge

After Merge

level 0

level 1

level 2

Fig. 6: Partitioned tiering with horizontal grouping

of partitioned leveling to support tiering. Alternatively, un-

der the horizontal grouping scheme, each logical disk com-

ponent, which is range-partitioned into a set of SSTables,

serves as a group directly. This allows a disk component to

be formed incrementally based on the unit of SSTables. We

will discuss these two schemes in detail below.

An example of the vertical grouping scheme is shown

in Figure 5. In this scheme, SSTables with overlapping key

ranges are grouped together so that the groups have disjoint

key ranges. During a merge operation, all of the SSTables in

a group are merged together to produce the resulting SSTa-

bles based on the key ranges of the overlapping groups at

the next level, which are then added to these overlapping

groups. For example in the ﬁgure, the SSTables labeled 0-

30 and 0-31 at level 1 are merged together to produce the

SSTables labeled 0-12 and 17-31, which are then added to

the overlapping groups at level 2. Note the difference be-

tween the SSTables before and after this merge operation.

Before the merge operation, the SSTables labeled 0-30 and

0-31 have overlapping key ranges and both must be exam-

ined together by a point lookup query. However, after the

merge operation, the SSTables labeled 0-12 and 17-31 have

disjoint key ranges and only one of them needs to be exam-

ined by a point lookup query. It should also be noted that

under this scheme SSTables are no longer ﬁxed-size since

they are produced based on the key ranges of the overlap-

ping groups at the next level.

Figure 6 shows an example of the horizontal grouping

scheme. In this scheme, each component, which is range-

partitioned into a set of ﬁxed-size SSTables, serves as a log-

ical group directly. Each level L further maintains an active

group, which is also the ﬁrst group, to receive new SSTa-

bles merged from the previous level. This active group can

be viewed as a partial component being formed by merging

the components at level L − 1 in the unpartitioned case. A

merge operation selects the SSTables with overlapping key

ranges from all of the groups at a level, and the resulting

SSTables are added to the active group at the next level. For

example in the ﬁgure, the SSTables labeled 35-70 and 35-65

at level 1 are merged together, and the resulting SSTables

labeled 35-52 and 53-70 are added to the ﬁrst group at level

2. However, although SSTables are ﬁxed-size under the hor-

izontal grouping scheme, it is still possible that one SSTable

from a group may overlap a large number of SSTables in the

remaining groups.

2.2.3 Concurrency Control and Recovery

We now brieﬂy discuss the concurrency control and recov-

ery techniques used by today’s LSM-tree implementations.

For concurrency control, an LSM-tree needs to handle con-

current reads and writes and to take care of concurrent ﬂush

and merge operations. Ensuring correctness for concurrent

reads and writes is a general requirement for access methods

in a database system. Depending on the transactional isola-

tion requirement, today’s LSM-tree implementations either

use a locking scheme [9] or a multi-version scheme [1,3,6].

A multi-version scheme works well with an LSM-tree since

obsolete versions of a key can be naturally garbage-collected

during merges. Concurrent ﬂush and merge operations, how-

ever, are unique to LSM-trees. These operations modify the

metadata of an LSM-tree, e.g., the list of active components.

Thus, accesses to the component metadata must be properly

synchronized. To prevent a component in use from being

deleted, each component can maintain a reference counter.

Before accessing the components of an LSM-tree, a query

can ﬁrst obtain a snapshot of active components and incre-

ment their in-use counters.

Since all writes are ﬁrst appended into memory, write-

ahead logging (WAL) can be performed to ensure their dura-

bility. To simplify the recovery process, existing systems

剩余25页未读，继续阅读

caifengzhu

粉丝: 0
资源: 6

LSM存储技术：一项综合调查

PyPI 官网下载 | lsm-db-0.6.1.tar.gz

LSM-tree.7z

使用说明 LD-K-.AK./LSM-...-LS, LD-K-.AK./LSH-...-LS[手册].pdf

使用说明 LD-K-.AK./LSH(LSM)-...-.S[手册].pdf

udisks2-lsm-2.9.0-6.el8.aarch64.rpm

udisks2-lsm-2.9.0-7.el8.aarch64.rpm

udisks2-lsm-2.9.0-7.el8.ppc64le.rpm

udisks2-lsm-2.8.4-1.el7.x86_64.rpm

udisks2-lsm-2.9.0-6.el8.x86_64.rpm

udisks2-lsm-2.9.0-6.el8.ppc64le.rpm

最新资源