USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 325
systems [21, 71, 73] bypass the DRAM page cache and access
NVMM directly using a technique calle d Direct Access (DAX)
or eXecute In Place (XIP), avoiding extra copies between
NVMM and DRAM in the storage stack. NOVA is a DAX
file system and we expect that all NVMM file systems will
provide these (or similar) features. We describe currently
available DAX file systems in Section 2.4.
Write reordering Modern processors and their caching
hierarchies may reorder store operations to improve perfor-
mance. The CPU’s memory consistency protocol makes guar-
antees about the ordering of memory updates, but existing
models (with the exception of research proposals [20, 46]) do
not provide guarantees on when updates will reach NVMMs.
As a result, a power failure may leave the data in an inconsis-
tent state.
NVMM-aware software can avoid this by explicitly flush-
ing caches and issuing memory barriers to enforce write
ordering. The x86 architecture provides the clflush in-
struction to flush a CPU cacheline, but clflush is strictly
ordered and needlessly invalidates the cacheline, incurring a
significant performance penalty [6, 76]. Also, clflush only
sends data to the memory controller; it does not guarantee
the data will reach memory. Memory barriers such as Intel’s
mfence instruction enforce order on memory operations be-
fore and after the barrier, but mfence only guarantees all
CPUs have the same view of the memory. It does not impose
any constraints on the order of data writebacks to NVMM.
Intel has proposed new instructions that fix these prob-
lems, including clflushopt (a more efficient version of
clflush), clwb (to explicitly write back a cache line with-
out invalidating it) and PCOMMIT (to force stores out to
NVMM) [26, 79]. NOVA is built with these instructions
in mind. In our evaluation we use a hardware NVMM emu-
lation system that approximates the performance impacts of
these instructions.
Atomicity POSIX-style file system semantics require
many operations to be atomic (i.e., to execute in an “all or
nothing” fashion). For example, the POSIX rename re-
quires that if the operation fails, neither the file with the old
name nor the file with the new name shall be changed or
created [53]. Renaming a file is a metadata-only operation,
but some atomic updates apply to both file system metadata
and data. For instance, appending to a file atomically updates
the file data and changes the file’s length and modification
time. Many applications rely on atomic file system operations
for their own correctness.
Storage devices typically provide only rudimentary guaran-
tees about atomicity. Disks provide atomic sector writes and
processors guarantee only that 8-byte (or smaller), aligned
stores are atomic. To build the more complex atomic up-
dates that file systems require, programmers must use more
complex techniques.
2.3. Building complex atomic operations
Existing file systems use a variety of techniques like journal-
ing, shadow paging, or log-structuring to provide atomicity
guarantees. These work in different ways and incur different
types of overheads.
Journaling Journaling (or write-ahead logging) is widely
used in journaling file systems [24, 27, 32, 71] and
databases [39, 43] to ensure atomicity. A journaling system
records all updates to a journal before applying them and, in
case of power failure, replays the journal to restore the system
to a consistent state. Journaling requires writing data twice:
once to the log and once to the target location, and to im-
prove performance journaling file systems usually only jour-
nal metadata. Recent work has proposed back pointers [17]
and decoupling ordering from durability [16] to reduce the
overhead of journaling.
Shadow paging Several file systems use a copy-on-write
mechanism called shadow paging [20, 8, 25, 54]. Shadow
paging file systems rely heavily on their tree structure to
provide atomicity. Rather than modifying data in-place during
a write, shadow paging writes a new copy of the affected
page(s) to an empty portion of the storage device. Then, it
splices the new pages into the file system tree by updating
the nodes between the pages and root. The resulting cascade
of updates is potentially expensive.
Log-structuring Log-structured file systems (LFSs) [55,
60] were originally designed to exploit hard disk drives’ high
performance on sequential accesses. LFSs buffer random
writes in memory and convert them into larger, sequential
writes to the disk, making the best of hard disks’ strengths.
Although LFS is an elegant idea, implementing it effi-
ciently is complex, because LFSs rely on writing sequentially
to contiguous free regions of the disk. To ensure a consistent
supply of such regions, LFSs constantly clean and compact
the log to reclaim space occupied by stale data.
Log cleaning adds overhead and degrades the performance
of LFSs [3, 61]. To reduce cleaning overhead, some LFS
designs separate hot and cold data and apply different clean-
ing policies to each [69, 70]. SSDs also perform best under
sequential workloads [9, 14], so LFS techniques have been
applied to SSD file systems as well. SFS [38] classifies file
blocks based on their update likelihood, and writes blocks
with similar “hotness” into the same log segment to reduce
cleaning overhead. F2FS [30] uses multi-head logging, writes
metadata and data to separate logs, and writes new data di-
rectly to free space in dirty segments at high disk utilization
to avoid frequent garbage collection.