1
Z codes: General Systematic Erasure Codes with Optimal Repair Bandwidth and Storage for
Distributed Storage Systems
Qing Liu
∗
, Dan Feng
∗
, Hong Jiang
†
, Yuchong Hu
∗
, Tianfeng Jiao
∗
∗
Wuhan National Laboratory for Optoelectronics (WNLO),
∗
School of Computer, Huazhong University of Science and Technology (HUST), China
†
University of Nebraska-Lincoln, USA
Email: {qing, dfeng}@hust.edu.cn, jiang@cse.unl.edu, {yuchonghu, tfjiao}@hust.edu.cn
Abstract—Erasure codes are widely used in distributed storage
systems to prevent data loss. Traditional erasure codes suffer
from a typical repair-bandwidth problem in which the amount
of data required to reconstruct the lost data, referred to as
the repair bandwidth, is often far more than the theoretical
minimum. While many novel erasure codes have been proposed
in recent years to reduce the repair bandwidth, these codes either
require extra storage capacity and computation overhead or are
only applicable to some special cases.
To address the weaknesses of the existing solutions to the
repair-bandwidth problem, we propose Z Codes, a general family
of codes capable of achieving the theoretical lower bound of
repair bandwidth for a single data node failure. To the best
of our knowledge, the Z codes are the first general systematic
erasure codes that achieve optimal repair bandwidth under the
minimum storage. Our in-memory performance evaluations of
a 1-GB file indicate that Z codes have encoding and repairing
speeds that are approximately equal to those of the Reed-Solomon
(RS) codes, and their speed on the order of GB/s practically
removes computation as a performance bottleneck.
Index Terms—Erasure Codes; Repair Bandwidth; Distributed
Storage System; Failure Tolerance
I. INTRODUCTION
Erasure codes are widely used in distributed storage systems
to recover from data loss in the event of server breakdown.
These codes incorporate data redundancy in a space-efficient
manner to tolerate data loss by reconstructing the lost data and
are systematic in that the original data is kept unchanged after
encoding and can be accessed without decoding. Typical sys-
tematic codes include Reed-Solomon (RS) codes and Cauchy
Reed-Solomon (CRS) codes.
However, such traditional erasure codes face a known
repair-bandwidth problem [1] that becomes increasingly more
important in a distributed environment where bandwidth is
typically expensive in terms of both performance and power
consumption. That is, in a storage system of data size M with
k data nodes and m parity (i.e., redundant) nodes that are
interconnected by a network of limited bandwidth, each node
stores data of size
M
k
and the repair of one node’s failure
requires a disk-I/O or network bandwidth of size M, which is
k times the size of the lost data (
M
k
). In this paper, we define
repair bandwidth as the amount of the data accessed by the
disk I/O and transferred over the network.
The minimum storage for an (m, k) code is
M
k
, so k
nodes of data can retain the original data. However, Dimakis
et al. pointed out that the theoretical minimum storage and
Storage overhead
Repair bandwidth
Z codes
RS codes
Minimum
storage
Minimum repair
bandwidth
Optimal
repair
Fig. 1: Theoretical lower-bound trade-off curve of storage
overhead and repair bandwidth.
minimum repair bandwidth cannot be achieved at the same
time and there is a lower-bound trade-off curve between the
two [1], as plotted in Fig. 1. Although codes with the minimum
storage cannot achieve the minimum repair bandwidth, their
theoretical repair bandwidth lower bound, which is called
optimal repair bandwidth [2], can be calculated as:
(m + k − 1)M/(mk) (1)
The repair bandwidth mentioned above refers particularly to a
single node failure, which is the most common case in practice.
Recently, many novel repair-bandwidth-efficient codes have
been proposed to reduce the repair bandwidth, but at the ex-
penses of (1) extra storage capacity, (2) additional computation
overhead or (3) being applicable only to some special cases.
The Simple Regenerating Codes (SRC) [3] and Local Recon-
struction Codes (LRC) [4] need additional storage resources to
store the extra parity information. The Functional Minimum
Storage Regenerating (FMSR) codes are not systematic and
only store parity information after encoding, thereby resulting
in a high computation cost [5]. The Rotate Reed-Solomon
(RRS) codes [6] also require additional computation for re-
pairing the failure of a single data node. Under the burden
of not having a general construction mechanism, the Zigzag
codes [7] are unsuitable for general storage systems. The
Product-matrix-MSR (PMSR) codes [2] are only applicable
when the code rate (the ratio of the data size and size of data
after encoding) is less than
1
2
, namely, m > k, which greatly
limits their applicability.
To address the above weaknesses in the existing codes, we
present in this paper a family of novel erasure codes, called
the Z codes. The Z codes not only can achieve the theoretical
optimal repair bandwidth under the minimum storage for a
single data node’s failure, but also have the following desirable
properties that make them suitable for distributed storage
systems. (1) The minimum storage property: the Z codes
consume exactly the same storage capacity as the RS and