2D故障编码法提升NoC链路多比特瞬态故障控制效率

13 浏览量更新于2024-08-27 收藏 324KB PDF 举报

本文主要探讨了在深纳米尺度的片上网络（NoC）设计中，面对多比特瞬态故障带来的挑战。传统的错误校验和检测（Error Correction Code, ECC）技术在处理这类复杂故障时，往往导致显著的面积、功耗和时序开销增加。为了寻求更高效且经济的解决方案，研究人员提出了"2D故障编码方法"，这是一种针对NoC链路的新型ECC策略。 2D故障编码方法的核心在于将网络连接的导线视作一个矩阵结构，利用轻量级奇偶校验编码（Lightweight Parity Check Coding, PCC）在水平矩阵行和垂直矩阵列两个维度上进行操作。这种方法巧妙地结合了卧式PCC（Horizontal PCC）和立式PCC（Vertical PCC），两者协同工作，能够精确定位到瞬态故障的位置。一旦确定了故障，只需简单的信号反转就能实施修复，避免了复杂的纠错逻辑，从而降低了硬件成本。作者们详细阐述了2D故障编码的实现过程，包括如何设计和执行PCC，以及如何优化其检测和纠正能力。实验结果显示，这种新型ECC方法相较于传统方法显著提高了故障检测覆盖率，极大地减少了静默故障（Silent Faults）的发生，同时在同等面积下实现了更高的故障校正率，证明了其在成本效益方面的优势。这篇研究论文为NoC链路的多位瞬态故障控制提供了一个创新且有效的解决方案，对于提高深纳米尺度集成电路的可靠性具有重要意义。通过采用2D故障编码方法，设计者可以在不影响系统性能的同时，有效降低系统的复杂性和功耗，从而推动了NoC技术在现代计算机体系结构中的应用和发展。

Multi-bit Transient Fault Control for NoC Links

Using 2D Fault Coding Method

Xiaowen Chen

†, ‡

, Zhonghai Lu

‡

, Yuanwu Lei

†

, Yaohua Wang

†

, Shenggang Chen

†

College of Computer, National University of Defense Technology, 410073, Changsha, China

‡

Department of Electronic Systems, KTH - Royal Institute of Technology, 16440 Kista, Stockholm, Sweden

‡

{xiaowenc,zhonghai}@kth.se

Abstract—In deep nanometer scale, Network-on-Chip (NoC)

links are more prone to multi-bit transient fault. Conventional

ECC techniques brings heavy area, power, and timing overheads

when correcting and detecting multiple transient faults. There-

fore, a cost-effective ECC technique, named 2D fault coding

method, is adopted to overcome the multi-bit transient fault

issue of NoC links. Its key innovation is that the wires of

a link are treated as its matrix appearance and light-weight

Parity Check Coding (PCC) is performed on the matrix’s two

dimensions (horizontal matrix rows and vertical matrix columns).

Horizontal PCCs and vertical PCCs work together to ﬁnd the

faults’ position and then correct them by simply inverting them.

The procedure of using the 2D fault coding method to protect

a NoC link is proposed, its correction and detection capability

is analyzed, and its hardware implementation is carried out.

Comparative experiments show that the proposal can largely

reduce the ECC hardware cost, have much higher fault detection

coverage, maintain almost zero silent fault percentages, and have

higher fault correction percentages normalized under the same

area, demonstrating that it is cost-effective and suitable to the

multi-bit transient fault control for NoC links.

I. INTRODUCTION

As the chip technology goes into the deep nanometer era,

i.e., its transistor feature size is reduced to be 45nm, 40nm,

28nm, and even smaller, integrated circuits characterized by

high frequency and low voltage will be increasingly suscep-

tible to transient faults and permanent faults. The occurrence

of transient faults is considered to be roughly 80%[1]. Relia-

bility of links challenges large-scale Network-on-Chip (NoC)

design. In deep nanometer scale, transient fault tolerance of

NoC links faces new phenomena: (I) The fault probability

of a link wire becomes bigger. The fault probability (ε)ofa

link wire can be characterized by the classic fault model[2][3]

with a Gaussian distribution as

ε = Q



2σ





∞

/2σ

√

2π

−y

dy (1)

where V

is supply voltage and σ

is noise voltage. Fig.

1 depicts the trends of supply voltage and the ratio of noise

voltage to supply voltage, according to the real technology

data from TSMC



foundry[4]. As the technology shrinks, the

The research is partially supported by the National Natural Science

Foundation of China (No. 61502508), the Hunan Natural Science Foundation

of China (No. 2015JJ3017), and the Doctoral Program of the Ministry of

Education in China (No. 20134307120034).

Supply Voltage (V)

2.5

1.8

8.3%

1.2

12.5%

1.2

12.5%

15%

0.9

16.7%

0.85

17.6%

m o

250nm 180nm 130nm 90nm 65nm 40nm 28nm

0.05

0.1

0.15

Ratio of Noise Volta

TSMC Technology

Fig. 1. Trends of supply voltage and the ratio of noise voltage from TSMC



supply voltage decreases for the main purpose of reducing the

chip power consumption. However, the proportion of voltage

noise in supply voltage becomes bigger. Therefore, according

to Equation (1), the increase of the ratio of noise voltage to

supply voltage results in the increase of the fault probability (ε)

of a link wire. (II) The fault probability of a link becomes

bigger. Because technology shrinking leads to narrower wire

and smaller distance between two adjacent wires and the width

of on-chip link is not subject to the limited IO resources

of a chip, on-chip link can be usually designed to be 256-

bit, 512-bit, and even more wider in order to improve the

bandwidth performance. Equation (2) shows that, as the link

width (notated as w) becomes bigger, multiple wires in a link

may have transient faults concurrently, resulting in the increase

of the fault probability (η) of a link[5]. Multiple faults existing

on the links have become more important[6][7].

η =1− (1 − ε)

(2)

NoC links are more prone to multi-bit transient fault

than ever before in deep nanometer scale, and it is a need

to study multi-bit transient fault control for NoC links.

Typically, fault tolerance can be achieved by redundancy.

Redundancy is achieved by redundant components to cope

with failing ones (spatial redundancy), by re-execution of a

data transmission with the same component (temporal redun-

dancy), and by adding information for fault detection and cor-

rection (information redundancy)[8]. In the paper, our scope

is multi-bit transient fault control for on-chip communication

links of large-scale NoCs via information redundancy.

In information redundancy, ECC (Error Correcting Codes)

[9][10] is a commonly used and effective protection technique.

978-1-4673-9030-9/16/$31.00

2016 IEEE

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38691641

粉丝: 5
资源: 929

2D故障编码法提升NoC链路多比特瞬态故障控制效率

2D-Mesh结构NoC无虚通道容错路由算法

片上网络链路故障容错路由新策略

NoC容错路由算法：应对路径故障与局部拥塞

基于故障粒度划分的NoC链路自适应容错方法

2D NoC Simulator Nirgam

noc

NOC

一种基于2D-mesh的低延迟NoC路由算法设计

基于虚通道故障粒度划分的3D NoC容错路由器设计.pdf

2D-mesh低延迟NoC路由算法：确定性无死锁设计与优化

最新资源