Ceph：IBM的高性能分布式文件系统详解

需积分: 15 100 浏览量更新于2024-09-11 收藏 247KB PDF 举报

"IBM 分布式文件系统 CEPH 是一种高性能、可扩展且可靠的分布式存储解决方案。它通过创新的数据分布策略（CRUSH）和智能的设备管理来优化数据和元数据的分离，确保在不可靠的对象存储设备（OSDs）组成的异构动态集群中的高效运行。CEPH 的设计使得数据复制、故障检测和恢复等功能分散到半自治的 OSDs 上，这些OSDs 运行本地对象文件系统，实现了高度自适应和容错性。此外，其动态分布式元数据集群能够灵活应对各种通用和科学计算工作负载，提供卓越的 I/O 性能和可扩展的元数据管理能力。实测表明，CEPH 可以支持超过 250,000 的并发文件操作，证明了其在大规模部署中的强大性能。" CEPH 分布式文件系统的核心特点和优势包括： 1. 数据分布策略：CEPH 使用称为 CRUSH (Controlled Replication Under Scalable Hashing) 的算法，该算法旨在为异构和动态环境中的 OSD 集群提供高效且均衡的数据分布。CRUSH 能够自动适应硬件变化，避免单点故障，并确保数据冗余。 2. 元数据管理：CEPH 的元数据管理是高度分布式和动态的，可以高效处理大量文件操作，且能够无缝适应不同的工作负载。这种设计使得元数据服务具有高可用性和可扩展性。 3. 设备智能与自主性：CEPH 的 OSDs 承担了数据复制、故障检测和恢复的任务，每个 OSD 都运行着一个本地对象文件系统，这增强了系统的整体容错性和恢复能力。OSDs 可以根据需要自主地执行任务，减轻了中心节点的压力。 4. 高性能与可扩展性：CEPH 能够提供出色的 I/O 性能，无论是在读写速度还是在处理大量并发请求方面。其设计允许线性扩展，随着硬件资源的增加，系统性能和容量也会相应增长。 5. 容错性与可靠性：CEPH 的设计目标是构建一个容错性强、可靠性高的系统。通过数据冗余和智能故障检测机制，CEPH 可以在硬件故障时自动恢复，保证数据的安全性和服务的连续性。 6. 多用途：CEPH 不仅适用于传统的文件系统工作负载，还可以支持块存储和对象存储，使其成为云计算和大数据环境的理想选择。 7. 开源社区支持：CEPH 是一个开源项目，拥有活跃的开发社区和广泛的用户基础，这意味着持续的改进、更新以及丰富的生态系统支持。 IBM 的 CEPH 分布式文件系统以其独特的设计和出色的技术特性，成为了现代数据中心和云存储解决方案的重要组成部分，能够满足大规模、高并发和高可靠性的存储需求。

ing Ceph’s client operation. The Ceph client runs on

each host executing application code and exposes a ﬁle

system interface to applications. In the Ceph prototype,

the client code runs entirely in user space and can be ac-

cessed either by linking to it directly or as a mounted

ﬁle system via FUSE [25] (a user-space ﬁle system in-

terface). Each client maintains its own ﬁle data cache,

independent of the kernel page or buffer caches, making

it accessible to applications that link to the client directly.

3.1 File I/O and Capabilities

When a process opens a ﬁle, the client sends a request

to the MDS cluster. An MDS traverses the ﬁle system

hierarchy to translate the ﬁle name into the ﬁle inode,

which includes a unique inode number, the ﬁle owner,

mode, size, and other per-ﬁle metadata. If the ﬁle exists

and access is granted, the MDS returns the inode num-

ber, ﬁle size, and information about the striping strategy

used to map ﬁle data into objects. The MDS may also

issue the client a capability (if it does not already have

one) specifying which operations are permitted. Capa-

bilities currently include four bits controlling the client’s

ability to read, cache reads, write, and buffer writes. In

the future, capabilities will include security keys allow-

ing clients to prove to OSDs that they are authorized to

read or write data [13, 19] (the prototype currently trusts

all clients). Subsequent MDS involvement in ﬁle I/O is

limited to managing capabilities to preserve ﬁle consis-

tency and achieve proper semantics.

Ceph generalizes a range of striping strategies to map

ﬁle data onto a sequence of objects. To avoid any need

for ﬁle allocation metadata, object names simply com-

bine the ﬁle inode number and the stripe number. Ob-

ject replicas are then assigned to OSDs using CRUSH,

a globally known mapping function (described in Sec-

tion 5.1). For example, if one or more clients open a ﬁle

for read access, an MDS grants them the capability to

read and cache ﬁle content. Armed with the inode num-

ber, layout, and ﬁle size, the clients can name and locate

all objects containing ﬁle data and read directly from the

OSD cluster. Any objects or byte ranges that don’t ex-

ist are deﬁned to be ﬁle “holes,” or zeros. Similarly, if a

client opens a ﬁle for writing, it is granted the capability

to write with buffering, and any data it generates at any

offset in the ﬁle is simply written to the appropriate ob-

ject on the appropriate OSD. The client relinquishes the

capability on ﬁle close and provides the MDS with the

new ﬁle size (the largest offset written), which redeﬁnes

the set of objects that (may) exist and contain ﬁle data.

3.2 Client Synchronization

POSIX semantics sensibly require that reads reﬂect any

data previously written, and that writes are atomic (i. e.,

the result of overlapping, concurrent writes will reﬂect a

particular order of occurrence). When a ﬁle is opened by

multiple clients with either multiple writers or a mix of

readers and writers, the MDS will revoke any previously

issued read caching and write buffering capabilities,

forcing client I/O for that ﬁle to be synchronous. That

is, each application read or write operation will block

until it is acknowledged by the OSD, effectively plac-

ing the burden of update serialization and synchroniza-

tion with the OSD storing each object. When writes span

object boundaries, clients acquire exclusive locks on the

affected objects (granted by their respective OSDs), and

immediately submit the write and unlock operations to

achieve the desired serialization. Object locks are simi-

larly used to mask latency for large writes by acquiring

locks and ﬂushing data asynchronously.

Not surprisingly, synchronous I/O can be a perfor-

mance killer for applications, particularly those doing

small reads or writes, due to the latency penalty—at least

one round-trip to the OSD. Although read-write sharing

is relatively rare in general-purpose workloads [22], it is

more common in scientiﬁc computing applications [27],

where performance is often critical. For this reason, it

is often desirable to relax consistency at the expense of

strict standards conformance in situations where appli-

cations do not rely on it. Although Ceph supports such

relaxation via a global switch, and many other distributed

ﬁle systems punt on this issue [20], this is an imprecise

and unsatisfying solution: either performance suffers, or

consistency is lost system-wide.

For precisely this reason, a set of high perfor-

mance computing extensions to the POSIX I/O interface

have been proposed by the high-performance computing

(HPC) community [31], a subset of which are imple-

mented by Ceph. Most notably, these include an O

LAZY

ﬂag for open that allows applications to explicitly relax

the usual coherency requirements for a shared-write ﬁle.

Performance-conscious applications which manage their

own consistency (e. g., by writing to different parts of

the same ﬁle, a common pattern in HPC workloads [27])

are then allowed to buffer writes or cache reads when

I/O would otherwise be performed synchronously. If de-

sired, applications can then explicitly synchronize with

two additional calls: lazyio

propagate will ﬂush a given

byte range to the object store, while lazyio synchronize

will ensure that the effects of previous propagations are

reﬂected in any subsequent reads. The Ceph synchro-

nization model thus retains its simplicity by providing

correct read-write and shared-write semantics between

clients via synchronous I/O, and extending the applica-

tion interface to relax consistency for performance con-

scious distributed applications.

3.3 Namespace Operations

Client interaction with the ﬁle system namespace is man-

aged by the metadata server cluster. Both read operations

剩余13页未读，继续阅读

狼行荒漠

粉丝: 2
资源: 3

Ceph：IBM的高性能分布式文件系统详解

分布式文件系统详解：从历史到现代

分布式文件系统：概述、架构与关键技术

GPFS与CEPH：分布式存储的较量

Ceph分布式文件系统-其他

分布式文件系统

漫谈分布式存储方案，GPFS 对话 CEPH1

ceph 原理 内容 总结

曙光ParaStor云存储系统.pdf

并行文件系统性能评估报告

超融合技术发展：从虚拟化到分布式存储的变革

最新资源

ceph 原理内容总结