A General Fault-Tolerant Minimal Routing
for Mesh Architectures
Hongzhi Zhao, Member, IEEE, Nader Bagherzadeh,
Fellow, IEEE, and Jie Wu, Member, IEEE
Abstract—Fault-tolerant minimal routing algorithms aim at finding a Manhattan
path between the source and destination nodes and route around all faulty nodes.
Additionally, some non-faulty nodes that are helpless to make up of a fault-tolerant
minimal path should also be routed around. How to label such non-faulty nodes
efficiently is a major challenge. State-of-the-art solutions could not address it very
well. We propose a path-counter method. It can label every node that are helpless
to make up of a fault-tolerant minimal path with low time complexity. By counter the
number of fault-tolerant minimal paths, it can: support arbitrary fault distribution,
check the existence of fault-tolerant minimal paths, not sacrifice any available fault-
tolerant minimal paths.
Index Terms—Allowed-path-counter method, fault-tolerant minimal routing,
network-on-chip, mesh architectures
Ç
1INTRODUCTION
THE mesh-connected topology is one of the most thoroughly inves-
tigated network topologies for massively parallel computer net-
works [1] and Network-on-Chip (NoC) [2]. 2/3-dimensional (2D/
3D) mesh are lower dimensional meshes that have been commonly
discussed due to structural regularity for ease of construction and
high potential for handling various algorithms [3]. In mesh archi-
tectures with a large number of computing nodes (i.e., processors,
cores or processing elements), the network performance depends
heavily on the efficiency of routing algorithms.
As the scale of mesh architecture increases, the chance of failure
also increases [2]. The complex nature of network also makes them
vulnerable to disturbances. Therefore, the ability of tolerating fail-
ure for routing algorithms is becoming increasingly important [3].
To find one fault-free path for reliable communication, fault-toler-
ant routing is usually used and has been studied extensively [3],
[4]. According to the length of path achieved, fault-tolerant routing
can be classified as two types: one is the fault-tolerant minimal
routing algorithm which always routes the message along a Man-
hattan distance path in mesh [3], [4]; the other is the non-minimal
fault-tolerant routing algorithm. Because minimal routing (or Man-
hattan routing) algorithms can always route the packet to the desti-
nation in the quickest way, it is meaningful to study the minimal
(or Manhattan) fault-tolerant algorithms for mesh networks with
faulty nodes [3], [4], [5]. This paper focuses on the fault-tolerant
ability, so link weight is not taken into account.
Major challenges of fault-tolerant minimal routing algorithms
include: tolerating arbitrary fault distribution [2], checking the exis-
tence of fault-tolerant minimal paths [4], not sacrificing any avail-
able fault-tolerant minimal paths, and low time-complexity [5].
This paper will examine these challenges. Section 2 reviews related
work. Section 3 presents an allowed-path-counter method. Section 4
introduces how to apply the allowed-path-counter method to
design fault-tolerant minimal routing algorithms. Section 5 con-
cludes the paper.
2RELATED WORK
There are many algorithms supporting fault-tolerant minimal rout-
ing in mesh architectures. In earlier years, only a certain number of
faulty nodes were tolerated. In 1996, Glass and Ni modified the
negative-first routing algorithm to make it (n-1)-fault tolerant for
n-dimensional meshes [1]. It can only tolerate one faulty node in
the 2D mesh network, or two faulty nodes in 3D mesh network.
The routing algorithm presented by Shih can tolerate any pattern
of faulty nodes as long as the number of faulty nodes is no more
than certain number [6]. Although it also supports minimal rout-
ing, the certain number of faulty nodes tolerated limits its applica-
tion. To tolerate more faulty nodes for routing algorithms, Wu
proposed a fault-tolerant extended XY-routing protocol [7]. It can
tolerate more faulty nodes than the aforementioned work. But his
approach does not support failures of edge nodes.
After that, the concept of “fault block model” was presented to
address above problems. Fault block model need label nodes that
are helpless to make up of a fault-tolerant Manhattan path because
of the blocking from faulty nodes. Except faulty nodes, Wang pre-
sented the concepts “useless node” and “unreachable node” to
describe such helpless nodes in 2003 [4]. For example, Fu et al.
used coarse-grained rectangle block model for cost-sensitive NoC
that can include all the faulty nodes regardless of the fault distribu-
tion [2], but their routing algorithm needs to deactivate healthy
nodes. Especially, it is possible for a healthy destination node to be
deactivated and then included in a fault block or fault ring. For this
condition, all the available fault-tolerant minimal paths from a
source node to the destination node are sacrificed. Fukushima et al.
proposed a routing algorithm named Overlapped-Ring-Chain-
Route for NoC to reduce the number of deactivated nodes in fine-
grained rectangle regions, but still sacrificed healthy nodes [8].
Especially, these works could not decide the existence of fault-tol-
erant paths. So even if the available fault-tolerant paths do not
exist, the packets will still be injected into the network so that the
network is congested pointlessly.
To decrease the number of sacrificed healthy nodes, Tse et al.
used minimal rectangle fault block to surround faulty nodes and
then designed a centralized fault-tolerant routing algorithm [9].
The rectangle region did not include healthy nodes, but many rect-
angle regions need to be preprocessed with high time-complexity
according to the global information of mesh networks. Although
the XY routing algorithm that was used can avoid deadlock prob-
lem, XY routing algorithm itself will sacrifice many available fault-
tolerant minimal paths for the sake of its low adaptivity. But, it can-
not check the existence of fault-tolerant paths.
To our knowledge, Minimal-Connected-Component (MCC)
fault block model proposed by Wang [4] and its variants [3] can
identify the existence of fault-tolerant minimal paths online. So
traffic that cannot reach the destination along available minimal
paths will not be injected into the network. However, constructing
all MCC fault blocks needs to collect and distribute the MCC infor-
mation via information exchanges among neighbors. Its time-com-
plexity is up to Oðn
3
Þ where n is the number of faulty nodes.
To overcome above drawbacks existed in fault block models, we
propose a Allowed-Path-Counter method for fault-tolerant mini-
mal routing algorithms. It can label all useless, all unreachable
nodes and prohibited nodes with low time-complexity by counting
every nodes fault-tolerant Manhattan paths to the source or desti-
nation node. Based on the result of labeling, it is easy to find fault-
tolerant minimal paths.
H. Zhao is with the School of Computer and Information Technology, Beijing Jiao
Tong University, Beijing 100044, China. E-mail: hzzhao@m.bjtu.edu.cn.
N. Bagherzadeh and J. Wu are with the Department of EECS, University of
California, Irvine, CA 92697. E-mail: {nader, wuj8}@uci.edu.
Manuscript received 19 Oct. 2016; revised 3 Jan. 2017; accepted 8 Jan. 2017. Date of pub-
lication 10 Jan. 2017; date of current version 15 June 2017.
Recommended for acceptance by J.D. Bruguera.
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TC.2017.2651828
1240 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 7, JULY 2017
0018-9340 ß 2017 IEEE. Personal use is per mitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.