无虚拟通道的故障容错芯片网络路由设计

165 浏览量更新于2024-08-29 收藏 383KB PDF 举报

本文是一篇关于"无虚拟通道的片上网络容错路由算法"的研究论文，作者是来自西安交通大学人工智能与机器人研究所的Pengju Ren、Qingxin Meng、Xiaowei Ren和Nanning Zheng。他们针对未来可靠且大规模并行多核系统的设计需求，提出了一种无需虚拟通道的故障容忍自适应路由策略，旨在降低设计复杂性、减少能耗和服务时间，并通过减少轻量级路由器的面积来进一步降低硬件故障的可能性。在当前的许多核心系统中，由于虚拟通道的使用可能会增加系统的复杂性和功耗，因此不依赖虚拟通道构建容错网络已经成为一个颇具吸引力的方向。论文的核心贡献在于构建了一个无环的通道依赖图，这个图能够打破所有的循环同时保持网络的连通性，确保数据传输的顺畅进行。这种方法避免了虚拟通道可能导致的死锁问题，从而提高了系统的可靠性。实验部分对一个8x8的二维网格网络进行了广泛的测试，结果表明，在均匀随机流量下，当10%的链接发生故障时，该容错路由方案能够达到99.73%的可靠性；而当故障率提升到20%时，可靠性仍维持在97.56%。这些数值体现了该方法的有效性和对于高故障环境的稳健性。文章的研究类别和主题包括电子设计自动化（EDA）2.1部分：片上通信和片上网络（NoC），专注于片上通信网络模型的建模和分析，以及容错和适应性路由技术的发展。该研究不仅对提高芯片级别的系统可靠性具有实际意义，也为未来的高性能、低功耗和高可用性的多核系统设计提供了理论基础和实践指导。通过这篇论文，我们可以看到研究人员对于如何在不牺牲性能的前提下优化片上网络设计的深入理解和创新尝试。

Fault-tolerant Routing for On-chip Network Without Using

Virtual Channels

Pengju Ren, Qingxin Meng, Xiaowei Ren, and Nanning Zheng

Institute of Artiﬁcial Intelligence and Robotics, Xi’an Jiaotong University Shaanxi, 710049, China

(pengjuren, qxmeng, renxiaowei66)@gmail.com, nnzheng@mail.xjtu.edu.cn

ABSTRACT

Thanks to its less design complexity, less power consumption

and service time, to avoid using virtual channel has became a

very attractive approach to building future reliable and mas-

sively parallel many-core systems. Furthermore, less area of

the light-weight router decrease the probability of failure.

To this end, by constructing an acyclic channel dependency

graph that breaks all cycles and preserves connectivity of

the network, we propose a new deadlock-free fault-tolerant

adaptive routing without virtual channel. Extensive exper-

iments of 8x8 2D-mesh network demonstrate 99.73% and

97.56% reliability under uniform random traﬃc when 10%

and 20% of the links are failed.

Categories and Subject Descriptors

EDA2.1 [On-chip Communication and Networks-on-

chip]: On-chip communication network modeling and anal-

ysis

General Terms

Algortihm, Design, Performance

Keywords

Fault tolerance, Networks-on-Chip, Without virtual chan-

nels, Reliability

1. INTRODUCTION

The ongoing miniaturization of semiconductor manufactur-

ing technologies enable assembling hundreds to thousands of

processing cores on a single chip [3]. On the other hand, as

the barriers to SoC scaling have risen with each successive

node shrink, one of the most important obstacles is that,

transistors are approaching the limits of scaling, because

gate widths are nearing the molecular scale and the need

to ensure ever-higher levels of control over dopant distribu-

tion and voltage characteristics are slamming up against the

fundamental limits of physical laws [2]. It comes with the

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers

or to redistribute to lists, requires prior speciﬁc permission and/or a fee.

Request permissions from Permissions@acm.org. DAC ’14, June 01 - 05

5/14/06$15.00. http://dx.doi.org/10.1145/2593069.2593141

downside of the components’ increased susceptibility to fail-

ure. Recently, Intel Corp announced that Broadwell’s 14nm

deployment would be somewhat delayed because of “defect

density issues” is a typical case.

As is well-known, radiation, electromagnetic interference,

electrostatic discharge, aging, process variability and dy-

namic temperature variation are the major reasons that cause

failures in mosfet based circuits. Survey articles in [2] [14] [5]

provide further details. The combination of aforementioned

factors will soon make long-term product reliability extremely

diﬃcult in complex modern many-core systems. Therefore,

in order to maintain connectivity and correct operations,

fault-tolerant approach must be taken into account in com-

munication fabric.

Therefore, we are facing two salient problems: how to eﬃ-

ciently connect the increasing number of on-chip computa-

tion and storage resources, furthermore, how to eﬀectively

manage decreasing transistor reliability. Network-on-Chip

(NoC) has emerged as an attractive solution that transmits

messages through a distributed system of programmable routers

connected by links. It can potentially achieve fault tolerance

by providing alternative choices when messages encounter

faulty regions. In this way, it permits a more eﬃcient and

ﬂexible utilization of communication resources than tradi-

tional point-to-point links and buses.

Fault control is composed of two procedures: fault diagno-

sis and fault containment, we use a dedicated build-in self

test(BIST) module as described by Cota er al.in [7] to pin-

point the location of fault components. Error-correcting-

codes(ECC) and other code schemes provide another in-

operation detection method, which have been widely used

to check single or multi-bits errors. These aforementioned

techniques are interesting topics that we leave undiscussed.

In this paper, we focus on the fault-tolerant routing itself.

It is worthy to mention that, in practice, combinations of

techniques have to be used together to provide a complete

protection against various types of faults.

It is foreseeable that unpredictable fault distribution lead to

irregular topology and the use of alternative paths to avoid

faults run the risk of deadlock. Many fault tolerant routing

algorithms applied adaptive routing and utilized multiple

virtual channels to route around fault region while ensur-

ing the absence of deadlock [18] [22] [19]. However, dynamic

routing and multiple virtual channels increased the complex-

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38697940

粉丝: 7
资源: 920

无虚拟通道的故障容错芯片网络路由设计

A Theory of Fault-Tolerant Routing in Wormhole Networks

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing

A Multi-link Fault-tolerant Routing Protocol for Three Dimensional Underwater Acoustic Sensor Networks

Fault-Tolerant Strategy for Real-Time System Based on Evolvable Hardware

Control performance-based fault-tolerant control for singular systems

Adaptive Neural-Fuzzy Sliding-Mode Fault-Tolerant Control for Uncertain Nonlinear Systems

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory

Discretized Streams _ A Fault-Tolerant Model for Scalable Stream Processing

Robust adaptive fault-tolerant control for uncertain linear systems with actuator failures

Guaranteed Cost Fault-Tolerant Control for Networked Control Systems with Sensor Faults

最新资源