Fault-tolerant Routing for On-chip Network Without Using
Virtual Channels
Pengju Ren, Qingxin Meng, Xiaowei Ren, and Nanning Zheng
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University Shaanxi, 710049, China
(pengjuren, qxmeng, renxiaowei66)@gmail.com, nnzheng@mail.xjtu.edu.cn
ABSTRACT
Thanks to its less design complexity, less power consumption
and service time, to avoid using virtual channel has became a
very attractive approach to building future reliable and mas-
sively parallel many-core systems. Furthermore, less area of
the light-weight router decrease the probability of failure.
To this end, by constructing an acyclic channel dependency
graph that breaks all cycles and preserves connectivity of
the network, we propose a new deadlock-free fault-tolerant
adaptive routing without virtual channel. Extensive exper-
iments of 8x8 2D-mesh network demonstrate 99.73% and
97.56% reliability under uniform random traffic when 10%
and 20% of the links are failed.
Categories and Subject Descriptors
EDA2.1 [On-chip Communication and Networks-on-
chip]: On-chip communication network modeling and anal-
ysis
General Terms
Algortihm, Design, Performance
Keywords
Fault tolerance, Networks-on-Chip, Without virtual chan-
nels, Reliability
1. INTRODUCTION
The ongoing miniaturization of semiconductor manufactur-
ing technologies enable assembling hundreds to thousands of
processing cores on a single chip [3]. On the other hand, as
the barriers to SoC scaling have risen with each successive
node shrink, one of the most important obstacles is that,
transistors are approaching the limits of scaling, because
gate widths are nearing the molecular scale and the need
to ensure ever-higher levels of control over dopant distribu-
tion and voltage characteristics are slamming up against the
fundamental limits of physical laws [2]. It comes with the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers
or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions@acm.org. DAC ’14, June 01 - 05
2014, San Francisco, CA, USA Copyright 2014 ACM 978-1-4503-2730-
5/14/06$15.00. http://dx.doi.org/10.1145/2593069.2593141
downside of the components’ increased susceptibility to fail-
ure. Recently, Intel Corp announced that Broadwell’s 14nm
deployment would be somewhat delayed because of “defect
density issues” is a typical case.
As is well-known, radiation, electromagnetic interference,
electrostatic discharge, aging, process variability and dy-
namic temperature variation are the major reasons that cause
failures in mosfet based circuits. Survey articles in [2] [14] [5]
provide further details. The combination of aforementioned
factors will soon make long-term product reliability extremely
difficult in complex modern many-core systems. Therefore,
in order to maintain connectivity and correct operations,
fault-tolerant approach must be taken into account in com-
munication fabric.
Therefore, we are facing two salient problems: how to effi-
ciently connect the increasing number of on-chip computa-
tion and storage resources, furthermore, how to effectively
manage decreasing transistor reliability. Network-on-Chip
(NoC) has emerged as an attractive solution that transmits
messages through a distributed system of programmable routers
connected by links. It can potentially achieve fault tolerance
by providing alternative choices when messages encounter
faulty regions. In this way, it permits a more efficient and
flexible utilization of communication resources than tradi-
tional point-to-point links and buses.
Fault control is composed of two procedures: fault diagno-
sis and fault containment, we use a dedicated build-in self
test(BIST) module as described by Cota er al.in [7] to pin-
point the location of fault components. Error-correcting-
codes(ECC) and other code schemes provide another in-
operation detection method, which have been widely used
to check single or multi-bits errors. These aforementioned
techniques are interesting topics that we leave undiscussed.
In this paper, we focus on the fault-tolerant routing itself.
It is worthy to mention that, in practice, combinations of
techniques have to be used together to provide a complete
protection against various types of faults.
It is foreseeable that unpredictable fault distribution lead to
irregular topology and the use of alternative paths to avoid
faults run the risk of deadlock. Many fault tolerant routing
algorithms applied adaptive routing and utilized multiple
virtual channels to route around fault region while ensur-
ing the absence of deadlock [18] [22] [19]. However, dynamic
routing and multiple virtual channels increased the complex-