ing Ceph’s client operation. The Ceph client runs on
each host executing application code and exposes a file
system interface to applications. In the Ceph prototype,
the client code runs entirely in user space and can be ac-
cessed either by linking to it directly or as a mounted
file system via FUSE [25] (a user-space file system in-
terface). Each client maintains its own file data cache,
independent of the kernel page or buffer caches, making
it accessible to applications that link to the client directly.
3.1 File I/O and Capabilities
When a process opens a file, the client sends a request
to the MDS cluster. An MDS traverses the file system
hierarchy to translate the file name into the file inode,
which includes a unique inode number, the file owner,
mode, size, and other per-file metadata. If the file exists
and access is granted, the MDS returns the inode num-
ber, file size, and information about the striping strategy
used to map file data into objects. The MDS may also
issue the client a capability (if it does not already have
one) specifying which operations are permitted. Capa-
bilities currently include four bits controlling the client’s
ability to read, cache reads, write, and buffer writes. In
the future, capabilities will include security keys allow-
ing clients to prove to OSDs that they are authorized to
read or write data [13, 19] (the prototype currently trusts
all clients). Subsequent MDS involvement in file I/O is
limited to managing capabilities to preserve file consis-
tency and achieve proper semantics.
Ceph generalizes a range of striping strategies to map
file data onto a sequence of objects. To avoid any need
for file allocation metadata, object names simply com-
bine the file inode number and the stripe number. Ob-
ject replicas are then assigned to OSDs using CRUSH,
a globally known mapping function (described in Sec-
tion 5.1). For example, if one or more clients open a file
for read access, an MDS grants them the capability to
read and cache file content. Armed with the inode num-
ber, layout, and file size, the clients can name and locate
all objects containing file data and read directly from the
OSD cluster. Any objects or byte ranges that don’t ex-
ist are defined to be file “holes,” or zeros. Similarly, if a
client opens a file for writing, it is granted the capability
to write with buffering, and any data it generates at any
offset in the file is simply written to the appropriate ob-
ject on the appropriate OSD. The client relinquishes the
capability on file close and provides the MDS with the
new file size (the largest offset written), which redefines
the set of objects that (may) exist and contain file data.
3.2 Client Synchronization
POSIX semantics sensibly require that reads reflect any
data previously written, and that writes are atomic (i. e.,
the result of overlapping, concurrent writes will reflect a
particular order of occurrence). When a file is opened by
multiple clients with either multiple writers or a mix of
readers and writers, the MDS will revoke any previously
issued read caching and write buffering capabilities,
forcing client I/O for that file to be synchronous. That
is, each application read or write operation will block
until it is acknowledged by the OSD, effectively plac-
ing the burden of update serialization and synchroniza-
tion with the OSD storing each object. When writes span
object boundaries, clients acquire exclusive locks on the
affected objects (granted by their respective OSDs), and
immediately submit the write and unlock operations to
achieve the desired serialization. Object locks are simi-
larly used to mask latency for large writes by acquiring
locks and flushing data asynchronously.
Not surprisingly, synchronous I/O can be a perfor-
mance killer for applications, particularly those doing
small reads or writes, due to the latency penalty—at least
one round-trip to the OSD. Although read-write sharing
is relatively rare in general-purpose workloads [22], it is
more common in scientific computing applications [27],
where performance is often critical. For this reason, it
is often desirable to relax consistency at the expense of
strict standards conformance in situations where appli-
cations do not rely on it. Although Ceph supports such
relaxation via a global switch, and many other distributed
file systems punt on this issue [20], this is an imprecise
and unsatisfying solution: either performance suffers, or
consistency is lost system-wide.
For precisely this reason, a set of high perfor-
mance computing extensions to the POSIX I/O interface
have been proposed by the high-performance computing
(HPC) community [31], a subset of which are imple-
mented by Ceph. Most notably, these include an O
LAZY
flag for open that allows applications to explicitly relax
the usual coherency requirements for a shared-write file.
Performance-conscious applications which manage their
own consistency (e. g., by writing to different parts of
the same file, a common pattern in HPC workloads [27])
are then allowed to buffer writes or cache reads when
I/O would otherwise be performed synchronously. If de-
sired, applications can then explicitly synchronize with
two additional calls: lazyio
propagate will flush a given
byte range to the object store, while lazyio synchronize
will ensure that the effects of previous propagations are
reflected in any subsequent reads. The Ceph synchro-
nization model thus retains its simplicity by providing
correct read-write and shared-write semantics between
clients via synchronous I/O, and extending the applica-
tion interface to relax consistency for performance con-
scious distributed applications.
3.3 Namespace Operations
Client interaction with the file system namespace is man-
aged by the metadata server cluster. Both read operations