2) Secondary Deduplication
Secondary deduplication systems adopt a standard
deduplication method. A secondary storage system -
facilitating data backup and restore operations - requires high
I/O throughput.
Both Sparse Indexing [5] and SiLo [14] efficiently affiliate
disk bottleneck through exploiting similarity among data
segments. Deduplication is performed if two data segments are
similar to each other. Nam et al. introduced a Chunk
Fragmentation Level (CFL) monitor [31]. If the CFL monitor
indicates a chunk fragmentation is becoming worse, the
monitor will selectively rewrite some chunks to reduce
fragmentation, thereby achieving high restore performance.
Dedupv1 [18] and ChunkStash [6] store metadata in flash
memory for chunk index exploiting high random-read
performance of SSD to accelerate chunk index. SAR [7] stores
unique data chunks in SSDs with high reference count
exploiting high random-read performance of SSDs to greatly
improve restore performance.
3) Primary Deduplication
Data of emails, multimedia, and databases is stored in
primary deduplication systems, which are frequently accessed
by users in a random fashion. Compared with data restore in
secondary systems, primary storage systems simply read files
demanded by users each time rather than restoring the entire
dataset. As such, the reading latency can be significantly
reduced.
iDedup [4] exploits spatial locality to deduplicate data
sequence that are contiguously stored on a disk to reduce disk
fragmentation, thereby improving reading speed. What is more,
iDedup takes advantage of temporal locality with a smaller
memory to cache metadata and; therefore, iDedup
dramatically reduces disk I/Os and improves writing speed.
More importantly, data deduplication has been integrated into
practical storage products (e.g., NetApp ASIS [35] and EMC
Celerra [36]) and file systems (e.g. ZFS [37], SDFS [34],
LessFS [32], and LBFS [11]).
B. SSD Endurance
SSDs are favored by IT companies and end users thanks to
SSDs’ low access latency, high energy efficiency, and high
storage density. However, the most serious challenge of SSDs
lies in limited write endurance (i.e., SSDs only stand very
limited number of write bytes) [27]. This problem is caused by
two reasons:
1) Storage units inner flash chips have to be erased before
any re-write operation. To pursue high storage density and low
cost, mainstream SSDs have adopted multi-level cell flash
(MLC) instead of single-level cell (SLC). A SLC flash
memory array typically supports approximately 100,000
erasures cycles for each basic unit. The value of MLC flash,
which can store more than one bit in each unit, drops as low as
5,000 ~ 10,000 cycles or even lower [16, 28].
2) Write amplification. Each erase-unit inner flash chips
generally contains hundreds of pages – a basic read/write unit
of flash [28]. When an erase-unit – containing mixed valid
pages and invalid pages (old version of updated data) – is
about to be erased, the valid pages must be written to other
place, thereby imposing extra writes. Therefore, the amount of
data written on flash chips of an SSD are much larger than the
amount of data issued to the SSD.
In recent years, a considerable amount of research has
been done to enhance SSD endurance. For example, Griffin
employed hard disk drives or HDDs as a write cache for SSDs
to coalesce overwrites, aiming to significantly reduce write
traffic to the SSDs [16]. Chen et al. proposed a hetero-buffer,
which consists of DRAM and reorder area [17]. DRAM
devotes to improving hit ratio, reorder area aims at exhibiting
write amplification. Both of these techniques allow SSDs to
reduce the number of erased physical blocks. I-CASH
arranges SSDs to store seldom-changed and mostly read
reference data, whereas a HDD stores a log of changed deltas
of SSD data [30]. In the I-CASH case, SSDs obviate the need
of handling random writes. Kim et al. apply data deduplication
technology in SSDs to reduce write amplification effects [10].
C. Cache Management for SSDs
Traditional cache replacement algorithms (e.g., FIFO,
LRU, LFU, LIRS [15], and ARC [9]) make great efforts in
obtaining high cache hit rates by frequently updating cached
contents without any restriction. The write endurance of SSD
products is inadequate to meet such excessive write requests,
which inevitably lead to a short lifetime of the SSD cache (see
an example in Table 1). To overcome this problem, vendors
and researchers have proposed various solutions that exploit
application characteristics to limit the number of writes to
SSD cache.
For example, Oracle Exadata’s Smart Flash Cache [24]
combines the cache management and database logic together,
skipping some unimportant data for Flash cache to reduce
SSD writes. Netapp [33] - employing SSDs as read cache -
categories data into different priorities, namely, metadata,
normal data, and low-priority data. SSDs can be configured to
only cache data with specified priorities, with the result of
reducing write load. LARC [19] manages a virtual LRU queue
with a limited length prior to SSD cache; hitting the LRU
queue is the only condition of entering SSD to reduce the
amount of written data. EMC’s FAST Cache [20], Intel’s
Fi