《硬件安全与信任》：威胁环境下的集成电路设计与部署

Security;

Hardware

需积分: 13 172 浏览量更新于2024-07-17 1 收藏 7.61MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

1 AES Datapaths on FPGAs: A State of the Art Analysis 7

Fig. 1.4 The SRL16 (previous Xilinx FPGAs) and SRL32 (current Xilinx FPGAs) LUT modes

typically not requiring any additional functional logic components. This speciﬁc

routing is performed when mapping, placing, and routing the structure onto the

FPGA. However, ShiftRows and InvShiftRows (used on encryption and decryption,

respectively) have opposite shifting directions. Thus the routing path of each opera-

tion cannot be shared.

Performing the (Inv)ShiftRows operation through routing is often the preferred

choice in several proposed 128-bit datapaths such as Bulens et al. [2] and Liu et al.

[17]. However, this implies that a particular implementation can only handle one

ciphering mode. With this approach, two AES cores need to be deployed when

supporting encryption and decryption, as used in HELION Standard and HELION

Fast AES cores [13]. In order to support both encryption and decryption on a single

AES design, both routing options need to coexist. If properly designed, and given

the similarity of the remaining computations, only minimum multiplexing logic is

needed, as presented in Chaves et al. [4].

In smaller datapaths of 32 and 8-bit widths, performing the (Inv)ShiftRows

through routing is not viable, since the 16 bytes of the State are not available at

the same time. The predominant state of the art s olution for the (Inv)ShiftRows in

compact FPGA structures is using addressable memory, as introduced in Chodowiec

and Gaj [5]. These authors show how a RAM memory can be used to temporarily store

the State matrix between rounds, and perform either the ShiftRows or InvShiftRows

by properly addressing the writing and reading operations of the consecutive 32-bit

columns, or 8-bit cells, of the State [8, 11]. The authors further optimize this byte

shift operation by eliminating the need to specify the writing address. This approach

is optimized on Xilinx FPGAs using particular LUTs. On these devices, several LUTs

have an operational mode called SRL32 (SRL16 in older versions). This mode allows

for a single LUT to work as a 32-bit deep shift register with an addressable reading

port, resulting i n improved resource usage efﬁciency, as depicted in Fig. 1.4.This

approach can be found in 32-bit [5, 20, 23] and 8-bit [6, 25] AES designs.

8 J.C. Resende and R. Chaves

1.3.3 (Inv)SubBytes Implementations: Logic Versus Memory

Another major implementation differentiation in the state of the art is in the byte

substitution operation. These vary from a ﬁne-grained implementation of the byte

substitution (Logic-based) [6, 14, 26], to more coarse grained ones using lookup

table (Memory-based) approaches [2, 17].

Logic-based structures implement the byte substitution operations by hard-wiring

their actual mathematical deﬁnition (Sect. 1.2.1) through logic components. If one

recalls Eq. (1.1), the SubBytes substitution requires ﬁve XOR operations for each bit,

but ﬁrst the multiplicative inverse of the input byte, in the GF(2

) ﬁnite ﬁeld, needs

to be calculated. The problem with the multiplicative inverse is that there is no direct

function to calculate it. It is possible to calculate the multiplicative inverse through the

Extended Euclidean Algorithm, but this solution is better suited for software rather

than hardware [7]. Another approach to compute this multiplicative inverse, more

oriented to hardware implementations, is to use Composite Fields [24, 26]. Within

logic-based SubBytes implementations, different subsets of Composite Fields can

be considered faster, or more compact, or allow for additional security features,

than other subsets [3, 18, 22, 26]. The logic-based solution for the InvSubBytes

computation is similar to SubBytes, but modiﬁcations are still needed.

Overall, logic-based SubBytes implementations are the most area efﬁcient but also

the slowest approaches, when compared to memory-based solutions. In a memory-

based SubBytes, byte substitution is implemented using a 256-byte lookup SBox

table [5, 7, 19]. On FPGAs this can be implemented through the use of multiple

FPGA LUTs [2, 17], or even BRAMs [5, 10]. Memory-based approaches can lead

to faster circuits at the cost of memory blocks.

On ASIC technology, the decision of using either logic-based or memory-based

SubBytes should be carefully analyzed [15]. However, on FPGAs, the use of logic-

based implementations has been losing relevancy in comparison to the memory-based

counterpart, mainly due to technology improvements. On older or more economical

FPGAs, one FPGA LUT can only be conﬁgured as a 4-input arbitrary function, with

two LUTs per FPGA Slice. On more high end FPGAs, such as the Xilinx Virtex 5

and onwards technologies, each Slice contains four 6-input LUTs that can be easily

combined into a single 8-input lookup table (the exact speciﬁcation of the AES SBox)

with a relatively low latency. If both SubBytes and InvSubBytes operations need to

be deployed, either a 9-bit lookup table needs to be considered, or two 8-bit lookup

tables multiplexed.

Another easily accessible solution is the use of embedded dual-port memory

blocks, BRAMs, that exist within the FPGA. These memory blocks easily allow to

store the 2k bits needed for each byte substitution operation.

Implementations that only allow for one ciphering mode often consider the use of

LUT-based SBoxes, for shorter clock latency (512 LUTs for 128-bit datapaths [2, 17]

and 32 LUTs for 8-bit datapaths [25]). Architectures that allow for both ciphering

modes often incorporate pipelined BRAM-based implementations, since they can

1 AES Datapaths on FPGAs: A State of the Art Analysis 9

easily store all tables in their larger memories (8 BRAMs for 128-bit datapaths [10]

and two BRAMs for 32-bit datapaths [5]).

1.3.4 Implementing the MixColumns: Logic

After the SubBytes and ShiftRows operations, in the encryption mode, the Mix-

Columns operation is computed by performing a matrix multiplication in GF(2

). In

this operation each 32-bit State column is multiplied by the left matrix of Eq. (1.2),

depicting the multiplication coefﬁcients. Similarly to the SubBytes operation, the

MixColumns can also be implemented using logic or lookup tables.

In the MixColumns operation each byte is multiplied by a set of four constants

({03}, {02}, {01}, and {01} in the case of encryption). As described in Sect. 1.2.3,

the multiplication by 2, in GF(2

), can be computed by shifting the input value once

to the left. If the resulting 9th bit is ‘1’, the entire result has to be bitwise XORed

(subtraction in GF(2

)) by ‘0x11B’, in order to perform the modular reduction. The

multiplication by 3 can be achieved by adding the multiplications by 1 (the input value

itself) and by 2 (with the addition in GF(2

) being performed by a bitwise XOR).

To conclude the MixColumns matrix multiplication, the multiplied values are

added in GF(2

) by a XOR tree, as

02 × a

⊕ 03 × a

⊕ 01 × a

01 × a

⊕ 02 × a

⊕ 03 × a

⊕ 01 × a

01 × a

⊕ 01 × a

⊕ 02 × a

⊕ 03 × a

03 × a

⊕ 01 × a

⊕ 02 × a

(1.3)

Overall, in a logic-based MixColumns operation, the matrix coefﬁcient multipli-

cations are relatively simple: it requires, for each byte, one 1-bit shift, one 8-bit con-

ditional XOR with the constant ‘0x1B’ to perform the modular reduction (computing

×02), and one 8-bit wide XOR to compute the addition (e.g., ×03 =×02 ⊕×01).

Figure 1.5 illustrates the multiplication of the four coefﬁcients, given one input byte.

Fig. 1.5 Circuit example for the GF(2

) encryption multiplication

10 J.C. Resende and R. Chaves

On a 128-bit datapath, the MixColumns requires a total of 128 7-input functions, or

256 6-input FPGA LUTs. On FPGAs this operation can be performed with relatively

low latency, in comparison with the SubBytes stage, as suggested by [2, 5, 10, 17].

On 8-bit datapaths, a single State byte is provided in each clock cycle. As such,

the resulting bytes cannot be completed on a single cycle, since each byte result-

ing from the MixColumns operation depends on four State bytes. Given this, f or

8-bit datapaths, registered accumulation can be used. One such approach was ﬁrst

introduced by Hämäläinen et al. [12] for ASIC technology, and later adapted for

FPGA by Chu and Benaissa [6]. The resulting structure is depicted in Fig. 1.6.

In this design, the input byte is shifted and XORed in order to obtain the 4 coefﬁ-

cient multiplications ({03; 01; 01; 02}). The resulting values are then XORed by zero

in the ﬁrst iteration and temporarily stored in four 8-bit registers. In the following

cycles, a new input byte suffers the same transformations but is XORed with the

previously stored 4-bytes. After 4+1 cycles, one matrix multiplication for one State

column is performed. After 16+1 cycles, the entirety of the MixColumns operation

can be completed. The issue with this approach [6, 12], is the fact that it requires a

32-bit parallel-to-serial converter, given the 8-bit datapath, as depicted at the bottom

of Fig. 1.6.

Instead of performing the 4 coefﬁcient multiplications in parallel, Sasdrich and

Güneysu [25] proposed an 8-bit-only accumulative implementation that performs

one coefﬁcient multiplication per iteration, as illustrated in Fig. 1.7.

With this approach, a signiﬁcant area reduction can be achieved by further folding

the matrix multiplication and by not needing the parallel-to-serial converter. Addi-

tional resources can be saved by preloading a Round Key byte into the register, thus

Fig. 1.6 Chu and Benaissa [6] Accumulative MixColumns 8-by-32-by-8 bits

剩余253页未读，继续阅读

soctest2010

粉丝: 0
资源: 6

《硬件安全与信任》：威胁环境下的集成电路设计与部署

Hardware IP Security and Trust—Validation and Test

Hardware_Security_and_Trust.pdf

ARM Security trustzone

Android Application Security Essentials

Network Security: Private Communication in a Public World, Second Edition

Bulletproof SSL and TLS，PDF , Ivan Ristic

X11 Forwarding and Applications in Mobaxterm

Plugin Installation and Customization Features in MobaXterm

基于C语言的Dao编程语言设计源码

如何自定义数据集进行目标检测_keras-yolo3.zip

基于JavaScript及多语言融合的勤工俭学平台设计源码

初始化对LoRA微调动态的影响研究

【PFJSP问题】基于matlab豪猪算法CPO求解置换流水车间调度问题PFSP【含Matlab源码 7895期】.mp4

IGWO-SVM：改良的灰狼优化算法改进支持向量机 采用三种改进思路：两种Logistic和Tent混沌映射和采用DIH策略

Spring-dbUtil-xml-proxy

HCIE-Security V2.0培训材料

基于STM32单片机智能药盒定时吃药喂水蓝牙APP设计（毕业设计）

最新资源

IGWO-SVM：改良的灰狼优化算法改进支持向量机采用三种改进思路：两种Logistic和Tent混沌映射和采用DIH策略