Multi-bit Transient Fault Control for NoC Links
Using 2D Fault Coding Method
Xiaowen Chen
†, ‡
, Zhonghai Lu
‡
, Yuanwu Lei
†
, Yaohua Wang
†
, Shenggang Chen
†
†
College of Computer, National University of Defense Technology, 410073, Changsha, China
‡
Department of Electronic Systems, KTH - Royal Institute of Technology, 16440 Kista, Stockholm, Sweden
‡
{xiaowenc,zhonghai}@kth.se
Abstract—In deep nanometer scale, Network-on-Chip (NoC)
links are more prone to multi-bit transient fault. Conventional
ECC techniques brings heavy area, power, and timing overheads
when correcting and detecting multiple transient faults. There-
fore, a cost-effective ECC technique, named 2D fault coding
method, is adopted to overcome the multi-bit transient fault
issue of NoC links. Its key innovation is that the wires of
a link are treated as its matrix appearance and light-weight
Parity Check Coding (PCC) is performed on the matrix’s two
dimensions (horizontal matrix rows and vertical matrix columns).
Horizontal PCCs and vertical PCCs work together to find the
faults’ position and then correct them by simply inverting them.
The procedure of using the 2D fault coding method to protect
a NoC link is proposed, its correction and detection capability
is analyzed, and its hardware implementation is carried out.
Comparative experiments show that the proposal can largely
reduce the ECC hardware cost, have much higher fault detection
coverage, maintain almost zero silent fault percentages, and have
higher fault correction percentages normalized under the same
area, demonstrating that it is cost-effective and suitable to the
multi-bit transient fault control for NoC links.
1
I. INTRODUCTION
As the chip technology goes into the deep nanometer era,
i.e., its transistor feature size is reduced to be 45nm, 40nm,
28nm, and even smaller, integrated circuits characterized by
high frequency and low voltage will be increasingly suscep-
tible to transient faults and permanent faults. The occurrence
of transient faults is considered to be roughly 80%[1]. Relia-
bility of links challenges large-scale Network-on-Chip (NoC)
design. In deep nanometer scale, transient fault tolerance of
NoC links faces new phenomena: (I) The fault probability
of a link wire becomes bigger. The fault probability (ε)ofa
link wire can be characterized by the classic fault model[2][3]
with a Gaussian distribution as
ε = Q
V
dd
2σ
N
=
∞
V
dd
/2σ
N
1
√
2π
e
−y
2
/2
dy (1)
where V
dd
is supply voltage and σ
N
is noise voltage. Fig.
1 depicts the trends of supply voltage and the ratio of noise
voltage to supply voltage, according to the real technology
data from TSMC
foundry[4]. As the technology shrinks, the
1
The research is partially supported by the National Natural Science
Foundation of China (No. 61502508), the Hunan Natural Science Foundation
of China (No. 2015JJ3017), and the Doctoral Program of the Ministry of
Education in China (No. 20134307120034).
0
1
2
3
Supply Voltage (V)
2.5
6%
1.8
8.3%
1.2
12.5%
1.2
12.5%
1
15%
0.9
16.7%
0.85
17.6%
m o
250nm 180nm 130nm 90nm 65nm 40nm 28nm
0.05
0.1
0.15
Ratio of Noise Volta
e
TSMC Technology
Fig. 1. Trends of supply voltage and the ratio of noise voltage from TSMC
supply voltage decreases for the main purpose of reducing the
chip power consumption. However, the proportion of voltage
noise in supply voltage becomes bigger. Therefore, according
to Equation (1), the increase of the ratio of noise voltage to
supply voltage results in the increase of the fault probability (ε)
of a link wire. (II) The fault probability of a link becomes
bigger. Because technology shrinking leads to narrower wire
and smaller distance between two adjacent wires and the width
of on-chip link is not subject to the limited IO resources
of a chip, on-chip link can be usually designed to be 256-
bit, 512-bit, and even more wider in order to improve the
bandwidth performance. Equation (2) shows that, as the link
width (notated as w) becomes bigger, multiple wires in a link
may have transient faults concurrently, resulting in the increase
of the fault probability (η) of a link[5]. Multiple faults existing
on the links have become more important[6][7].
η =1− (1 − ε)
w
(2)
NoC links are more prone to multi-bit transient fault
than ever before in deep nanometer scale, and it is a need
to study multi-bit transient fault control for NoC links.
Typically, fault tolerance can be achieved by redundancy.
Redundancy is achieved by redundant components to cope
with failing ones (spatial redundancy), by re-execution of a
data transmission with the same component (temporal redun-
dancy), and by adding information for fault detection and cor-
rection (information redundancy)[8]. In the paper, our scope
is multi-bit transient fault control for on-chip communication
links of large-scale NoCs via information redundancy.
In information redundancy, ECC (Error Correcting Codes)
[9][10] is a commonly used and effective protection technique.
978-1-4673-9030-9/16/$31.00
c
2016 IEEE