In Algorithm 2, the operand Y (multiplicand) is scanned
word-by-word, and the operand X is scanned bit-by-bit.
The operand length is n bits, and the wordlength is w bits.
e ¼d
nþ1
w
e words are required to store S since its range is
½0; 2M 1. The original M and Y are extended by one extra
bit of 0 as the most significant bit. Presented as vectors,
M ¼
0;M
ðe1Þ
; ... ;M
ð1Þ
;M
ð0Þ
;
Y ¼
0;Y
ðe1Þ
; ... ;Y
ð1Þ
;Y
ð0Þ
;
S ¼
0;S
ðe1Þ
; ... ;S
ð1Þ
;S
ð0Þ
;
and X ¼
x
n1
; ... ;x
1
;x
0
:
The carry variable C
ðjÞ
has two bits, as explained below.
Assuming C
ð0Þ
¼ 0, each subsequent value of C
ðjþ1Þ
is
given by
C
ðjþ1Þ
;S
ðjÞ
¼ C
ðjÞ
þ x
i
Y
ðjÞ
þ q
i
M
ðjÞ
þ S
ðjÞ
:
Assuming that C
ðjÞ
3, we obtain
C
ðjþ1Þ
;S
ðjÞ
¼ C
ðjÞ
þ x
i
Y
ðjÞ
þ q
i
M
ðjÞ
þ S
ðjÞ
3 þ 3 ð2
w
1Þ
¼ 3 2
w
:
ð5Þ
From (5), we have C
ðjþ1Þ
3. By induction, C
ðjÞ
3 is
ensured for any 0 j e 1. Additionally, based on the
fact that S 2M, we have C
ðeÞ
1.
The data dependency graph of the hardware implemen-
tation for the MWR2MM algorithm by Tenca and Koc¸is
shown in Fig. 1. Each circle in the graph represents an
atomic computation and is labeled according to the type of
action performed. Task A consists of computing lines 2.3
and 2.4 in Algorithm 2. Task B corresponds to computing
lines 2.6 and 2.7 in Algorithm 2.
The data dependencies among the operations within
j loop makes it impossible to execute the steps in a single
iteration of j loop in parallel. However, parallelism is
possible among executions of different iterations of i loop.
In [4], Tenca and Koc¸ suggested that each column in the
graph may be computed by a separate processing element
(PE), and the data generated from one PE may be passed
into another PE in a pipelined fashion. Following this
method, all atomic computations represented by circles in
the same row can be processed concurrently. The proces-
sing of each column takes e þ 1 clock cycles (one clock cycle
for Task A, e clock cycles for Task B). Because there is a
delay of two clock cycles between the processing of a
column for x
i
and the processing of a column for x
iþ1
, the
minimum computation time T (in clock cycles) is T ¼
2n þ e 1 given P
max
¼d
eþ1
2
e PEs are implemented to work
in parallel. In this configuration, after e þ 1 clock cycles,
PE #0 switches from executing column 0 to executing
column P
max
. After another two clock cycles, PE #1 switches
from executing column 1 to executing column P
max
þ 1, etc.
The opportunity of improving the implementation
performance of Algorithm 2 is to reduce the delay between
the processing of two subsequent iterations of i loop from
two clock cycles to one clock cycle. The two-clock-cycle
delay comes from the right shift (division by two) in both
Algorithm 1 and 2. Take the first two PEs in Fig. 1 for
example. These two PEs compute the S words in the first
two columns. Starting from clock #0, PE #1 has to wait for
two clock cycles before it starts the computation of
S
ð0Þ
ði ¼ 1Þ in the clock cycle #2.
In order to reduce the two-clock-cycle delay to half, we
propose an approach of precomputing the partial results
using two possible assumptions regarding the most sig-
nificant bit of the previous word. As shown in Fig. 2, PE #1
can take the w 1 most significant bits of S
ð0Þ
ði ¼ 0Þ from
PE #0 at the beginning of clock #1, do a right shift, and
compute two versions of S
ð0Þ
ði ¼ 1Þ, based on the two
different assumptions about the most significant bit of this
word at the start of computations. At the beginning of the
clock cycle #2, the previously missing bit becomes available
as the least significant bit of S
ð1Þ
ði ¼ 0Þ. This bit can be used
to choose between the two precomputed versions of
S
ð0Þ
ði ¼ 1Þ. Similarly, in the clock cycle #2, two different
versions of S
ð0Þ
ði ¼ 2Þ and S
ð1Þ
ði ¼ 1Þ are computed by
PE #2 and PE #1, respectively, based on two different
assumptions about the most significant bits of these words
at the start of computations. At the beginning of the clock
cycle #3, the previously missing bits become available as the
least significant bits of S
ð1Þ
ði ¼ 1Þ and S
ð2Þ
ði ¼ 0Þ, respec-
tively. These two bits can be used to choose between the
two precomputed versions of these words. The same
pattern of computations is repeated in subsequent clock
cycles. Furthermore, since e words are enough to represent
the values in S, S
ðeÞ
is discarded in our designs. Therefore,
e clock cycles are required to compute one iteration of S.
The proposed optimization technique can be applied
onto both nonredundant and redundant representation of
the partial sum S, as demonstrated in Fig. 3. It is logically
HUANG ET AL.: NEW HARDWARE ARCHITECTURES FOR MONTGOMERY MODULAR MULTIPLICATION ALGORITHM 925
Fig. 1. Data dependency graph of the original architecture of MWR2MM
algorithm [4].