X. Pei et al. / Future Generation Computer Systems 69 (2017) 24–40 27
Table 1
Parameter notations.
Symbols Representations Range
(n, k, r) Coding parameters 1 ≤ r ≤ n − k
m Number of modified data blocks 1 ≤ m ≤ k
g Number of groups g ∈ [1, k]
D
i
The ith original data block i ∈ [0, k − 1]
D
′
i
The ith updated data block i ∈ [0, k − 1]
P
i
The ith original parity block i ∈ [0, r − 1]
P
∗
i,j
The ith partial parity block from the jth group i ∈ [0, r − 1], j ∈ [1, g]
P
′
i
The ith updated parity block i ∈ [0, r − 1]
α
i,j,l
Coefficient of lth data block, jth parity block, ith group i ∈ [1, g], j ∈ [0, r − 1]
reconstruction schemes proposed in [26] exploit the data locality
of multiple data blocks and organize the data transmission
along the reconstruction path in a pipelined way to reduce
the reconstruction time. However, the reconstruction schemes
is sensitive to slow nodes, where the reconstruction time is
limited by the node with the slowest performance along the
‘‘line’’ structure. Jun Li etc. [27] propose RCTREE to minimize the
reconstruction traffic by combining the advantage of regenerating
codes with a tree-structured regeneration topology. However,
the tree is constructed according to the available bandwidth,
which is hard to detect in the real distributed system. For
dealing with the situation of multiple failures, researchers [28,
29] propose reconstruction schemes for multiple failures based on
the tree structure, which improve the reconstruction performance
for multiple failures by parallel or cooperative reconstructions.
However, the trees constructed are independent with each other,
which may consume more network traffic. Moreover, researchers
in [30,31] propose the cooperative reconstruction schemes to
reduce the network traffic for multiple failures by cooperatively
exchanging data between new nodes. However, the nodes are
organized with star structure, which shows degraded transmission
efficiency compared to tree structure.
In this paper, we focus on the optimization of update efficiency
for erasure codes and propose a grouped and pipelined update
scheme Group-U based on erasure codes. Group-U groups the
multiple data nodes and elects one data node as the relayer
to organize the data flow from the data nodes to the parity
nodes. Furthermore, Group-U distributes the computation among
multiple nodes to improve the update efficiency. With the
comprehensive threshold and data cache technique, Group-U
ensures the data reliability and reconstructs the failed data with
the least overhead.
4. Grouped and pipelined data transmission
4.1. Architecture of framework
In this section, we propose a general three-layer update
framework for Group-U supporting both single and multiple
updates, which is illustrated in Fig. 1. The parameters used in this
section are showed in Table 1.
At a high level, the framework consists of three layers: the
data layer, the relay layer and the parity layer. The data layer
is responsible for grouping the data nodes and transmitting
the delta data of each data block to the relay layer. There are
three steps at this layer. Firstly, the m data nodes (denoted
as (node
0
, . . . , node
m
), with data blocks of (D
0
, . . . , D
m−1
)) are
divided into g groups, with m/g data blocks in each group, where
g is the group number. Thus, we get g grouped data blocks
(D
0
, . . . , D
m/g−1
), . . . , (D
(m(g−1))/g
, . . . , D
m−1
). In each group, we
elect one of the data nodes to be the relayer, which is responsible
for organizing the data flow from other data nodes within the
same group to the parity nodes. In this way, the data transmission
and the data computation are restricted within each group,
which improves the update efficiency. Then, each data node node
i
calculates the delta data, which is represented by D
′
i
−D
i
. Each data
node D
i
sends the delta to the relayer within each group. Finally,
each node
i
completes the update by replacing the original data D
i
with D
′
i
.
At the relay layer, there are g relayers (node
0
, node
m/g
, . . . ,
node
m(g−1)/g
), with node
(m(j−1))/g
as the relayer in the jth group,
where 1 ≤ j ≤ g. There are two steps for each relayer at
this layer. Firstly, each relayer node
i
calculates the delta D
′
i
− D
i
and receives m/g − 1 deltas of the m/g − 1 data blocks from
m/g − 1 data nodes within a group. Thus, there are m/g deltas
((D
′
i
− D
i
), . . . , (D
′
i+(m/g)−1
− D
i+(m/g)−1
)) for each relayer node
i
.
Then, the relayer node
i
encodes the m/g − 1 received deltas and
its delta into r partial parity parts P
∗
i,j
with Eq. (5), where P
∗
i,j
is the
jth partial parity part of the ith group. Finally, these encoded partial
parity parts are sent to the parity layer, with P
∗
i,j
for parity
j
from the
relayer node
i
.
P
∗
i,j
=
(m∗i)/g−1
l=(m(i−1))/g
α
i,j,l
∗ (D
′
l
− D
l
).
1 ≤ i ≤ g, 0 ≤ j ≤ r − 1, 1 ≤ m ≤ k. (5)
At the parity layer, there are r parity nodes (denoted as
(parity
0
, . . . , parity
r−1
)) with r parity blocks (P
0
, . . . , P
r−1
) and
the r updated parity blocks are represented as (P
′
0
, . . . , P
′
r−1
). Each
parity node completes the update with two steps. Firstly, each
parity node parity
i
, 0 ≤ i ≤ r − 1 receives g partial parity parts
(P
∗
1,i
, . . . , P
∗
g,i
), with one part P
∗
j,i
from the jth group. Then, each
parity node parity
i
encodes the g received partial parity parts with
the stored parity data according to Eq. (6), where P
∗
j,i
is the partial
parity part received from the jth group, P
i
is the original parity
block, and P
′
i
is the updated parity block.
P
′
i
=
g
j=1
P
∗
j,i
+ P
i
, 0 ≤ i ≤ r − 1. (6)
The nodes in all the three layers cooperatively complete the
update process. When there are multiple data nodes to be updated
(m > 1), they are separated into g groups and the nodes in each
group complete the update operations independently. When there
is only one data node to be updated (i.e. m = 1), the data node will
send the delta to all the parity nodes after updating the stored data.
For both cases, the parity nodes update the parity data with Eq.
(6). In the following section, we specify how to improve the update
efficiency by dynamically adjusting the group size, distributing the
data computation and pipelining the data transmission.
4.2. Grouping the data nodes
In this section, we are concerned about why we should group
the data nodes to be updated and how to group the data nodes.
When there are m data nodes to be updated, it may cause
bottleneck if all the data nodes connect to one relayer to complete