提升性能的Montgomery模乘硬件架构创新

需积分: 10 183 浏览量更新于2024-09-11 收藏 2.62MB PDF 举报

本文主要探讨了"New Hardware Architectures for Montgomery Modular Multiplication Algorithm"，该研究由Miaoqing Huang（IEEE会员）、Kris Gaj和Tarek El-Ghazawi（IEEE Fellow）共同完成。Montgomery模乘是加密算法中的基础操作，如RSA和椭圆曲线密码系统广泛依赖。在1999年的CHES会议上，Tenca和Koc¸提出了Multiple-Word Radix-2 Montgomery Multiplication (MWR2MM)算法，并设计了一种经典的硬件架构，用于实现Montgomery模乘，其单次运算的时钟周期大约为2n个周期，其中n为操作数位数。然而，作者们在此篇论文中提出了两种新的硬件架构，旨在在保持几乎相同的时钟周期长度下，将单次Montgomery模乘的执行时间缩短到大约n个时钟周期，相比于Tenca和Koc¸的原始架构，性能提升了一倍。这两种新架构的核心在于预计算部分结果，基于对前一个单词最显著位的不同假设。它们分别通过预先处理的方式优化了计算过程，从而提高了运算效率和性能。这表明，随着技术的进步，研究人员不断寻找更高效的硬件实现方法，尤其是在加密领域，算法的性能优化直接影响到系统的实时性和安全性。理解并采用这些新型架构对于设计高效且安全的加密硬件至关重要，特别是在现代计算机体系结构中，速度和能耗的平衡成为了设计者面临的重要挑战。因此，学习和应用这些新的Montgomery模乘硬件架构对于推动加密技术的实际应用和未来发展具有重要意义。

In Algorithm 2, the operand Y (multiplicand) is scanned

word-by-word, and the operand X is scanned bit-by-bit.

The operand length is n bits, and the wordlength is w bits.

e ¼d

nþ1

e words are required to store S since its range is

½0; 2M  1. The original M and Y are extended by one extra

bit of 0 as the most significant bit. Presented as vectors,

M ¼



0;M

ðe1Þ

; ... ;M

ð1Þ

ð0Þ



;

Y ¼



0;Y

ðe1Þ

; ... ;Y

ð1Þ

ð0Þ



;

S ¼



0;S

ðe1Þ

; ... ;S

ð1Þ

ð0Þ



;

and X ¼



n1

; ... ;x



The carry variable C

ðjÞ

has two bits, as explained below.

Assuming C

ð0Þ

¼ 0, each subsequent value of C

ðjþ1Þ

given by



ðjþ1Þ

ðjÞ



¼ C

ðjÞ

þ x

 Y

ðjÞ

þ q

 M

ðjÞ

þ S

ðjÞ

Assuming that C

ðjÞ

 3, we obtain



ðjþ1Þ

ðjÞ



¼ C

ðjÞ

þ x

 Y

ðjÞ

þ q

 M

ðjÞ

þ S

ðjÞ

 3 þ 3 ð2

 1Þ

¼ 3  2

ð5Þ

From (5), we have C

ðjþ1Þ

 3. By induction, C

ðjÞ

 3 is

ensured for any 0  j  e  1. Additionally, based on the

fact that S  2M, we have C

ðeÞ

 1.

The data dependency graph of the hardware implemen-

tation for the MWR2MM algorithm by Tenca and Koc¸is

shown in Fig. 1. Each circle in the graph represents an

atomic computation and is labeled according to the type of

action performed. Task A consists of computing lines 2.3

and 2.4 in Algorithm 2. Task B corresponds to computing

lines 2.6 and 2.7 in Algorithm 2.

The data dependencies among the operations within

j loop makes it impossible to execute the steps in a single

iteration of j loop in parallel. However, parallelism is

possible among executions of different iterations of i loop.

In [4], Tenca and Koc¸ suggested that each column in the

graph may be computed by a separate processing element

(PE), and the data generated from one PE may be passed

into another PE in a pipelined fashion. Following this

method, all atomic computations represented by circles in

the same row can be processed concurrently. The proces-

sing of each column takes e þ 1 clock cycles (one clock cycle

for Task A, e clock cycles for Task B). Because there is a

delay of two clock cycles between the processing of a

column for x

and the processing of a column for x

iþ1

, the

minimum computation time T (in clock cycles) is T ¼

2n þ e  1 given P

max

¼d

eþ1

e PEs are implemented to work

in parallel. In this configuration, after e þ 1 clock cycles,

PE #0 switches from executing column 0 to executing

column P

max

. After another two clock cycles, PE #1 switches

from executing column 1 to executing column P

max

þ 1, etc.

The opportunity of improving the implementation

performance of Algorithm 2 is to reduce the delay between

the processing of two subsequent iterations of i loop from

two clock cycles to one clock cycle. The two-clock-cycle

delay comes from the right shift (division by two) in both

Algorithm 1 and 2. Take the first two PEs in Fig. 1 for

example. These two PEs compute the S words in the first

two columns. Starting from clock #0, PE #1 has to wait for

two clock cycles before it starts the computation of

ð0Þ

ði ¼ 1Þ in the clock cycle #2.

In order to reduce the two-clock-cycle delay to half, we

propose an approach of precomputing the partial results

using two possible assumptions regarding the most sig-

nificant bit of the previous word. As shown in Fig. 2, PE #1

can take the w  1 most significant bits of S

ð0Þ

ði ¼ 0Þ from

PE #0 at the beginning of clock #1, do a right shift, and

compute two versions of S

ð0Þ

ði ¼ 1Þ, based on the two

different assumptions about the most significant bit of this

word at the start of computations. At the beginning of the

clock cycle #2, the previously missing bit becomes available

as the least significant bit of S

ð1Þ

ði ¼ 0Þ. This bit can be used

to choose between the two precomputed versions of

ð0Þ

ði ¼ 1Þ. Similarly, in the clock cycle #2, two different

versions of S

ð0Þ

ði ¼ 2Þ and S

ð1Þ

ði ¼ 1Þ are computed by

PE #2 and PE #1, respectively, based on two different

assumptions about the most significant bits of these words

at the start of computations. At the beginning of the clock

cycle #3, the previously missing bits become available as the

least significant bits of S

ð1Þ

ði ¼ 1Þ and S

ð2Þ

ði ¼ 0Þ, respec-

tively. These two bits can be used to choose between the

two precomputed versions of these words. The same

pattern of computations is repeated in subsequent clock

cycles. Furthermore, since e words are enough to represent

the values in S, S

ðeÞ

is discarded in our designs. Therefore,

e clock cycles are required to compute one iteration of S.

The proposed optimization technique can be applied

onto both nonredundant and redundant representation of

the partial sum S, as demonstrated in Fig. 3. It is logically

HUANG ET AL.: NEW HARDWARE ARCHITECTURES FOR MONTGOMERY MODULAR MULTIPLICATION ALGORITHM 925

Fig. 1. Data dependency graph of the original architecture of MWR2MM

algorithm [4].

剩余13页未读，继续阅读

哈喽小狍子

粉丝: 1

提升性能的Montgomery模乘硬件架构创新

Hardware Architectures for Deep Learning

ISCA-2017-Hardware-Architectures-for-DNN-Tutorial

IO Architectures for Virtualization

Architectures for Baseband Signal Processing

Learning Deep Architectures for AI

protocols and architectures for wireless sensor

Neural Architectures for Named Entity Recognition

FLEXIBLE ARCHITECTURES FOR WIDEBAND SDR CHANNELISATION

DSP ALGORITHMS AND ARCHITECTURES FOR TELECOMMUNICATION

Microservice Architectures for ADAS.pdf

最新资源