Towards Efficient Kyber on FPGAs: A Processor for Vector of Polynomials
Zhaohui Chen
∗
Yuan Ma
†
Tianyu Chen
School of Computer Science State Key Laboratory of State Key Laboratory of
and Technology Information Security Information Security
University of Chinese Institute of Information Institute of Information
Academy of Sciences Engineering, CAS Engineering, CAS
Beijing, China 100049 Beijing, China 100095 Beijing, China 100095
chenzhaohui17@mails.ucas.ac.cn mayuan@iie.ac.cn chentianyu@iie.ac.cn
Jingqiang Lin Jiwu Jing
State Key Laboratory of Information Security School of Computer Science and Technology
Institute of Information Engineering, CAS University of Chinese Academy of Sciences
Beijing, China 100095 Beijing, China 100049
linjingqiang@iie.ac.cn jwjing@ucas.ac.cn
Abstract—Kyber is a promising candidate in post-quantum
cryptography standardization process. In this paper, we propose
a targeted optimization strategy and implement a processor for
Kyber on FPGAs. By merging the operations, we cut off 29.4%
clock cycles for Kyber512 and 33.3% for Kyber1024 compared
with the textbook implementations. We utilize Gentlemen-Sande
(GS) butterfly to optimize the Number-Theoretic Transform
(NTT) implementation. The bottleneck of memory access is
broken taking advantage of a dual-column sequential scheme.
We further propose a pipeline architecture for better perfor-
mance. The optimizations help the processor achieve 31684 NTT
operations per second using only 477 LUTs, 237 FFs and 1 DSP.
Our strategy is at least 3x more efficient than the state-of-the-art
module for NTT with a similar security level.
I. INTRODUCTION
Public key cryptography based on large integer factoring
and discrete logarithm problem is widely used in digital sig-
nature, electronic authentication and TLS/SSL key exchange,
etc. Quantum computers would completely break these cryp-
tosystems with Shor’s algorithm [1]. To seek for appropriate
substitutes, the National Institute of Standards and Technology
(NIST) called for post-quantum public-key encryption, key
encapsulation mechanism and digital signature schemes in
2017. Interest in lattice-based cryptography has increased due
to the quantum-resistant properties and the potential for high-
speed implementation with relatively small key and ciphertext
size [2], [3].
Regev [4], [5] introduced Learning With Errors (LWE)
problem supported by a theoretical proof of security. However,
a large parameter matrix A limits its efficiency. Lyuba-
shevsky [6] et al. proposed Ring-Learning With Errors (Ring-
LWE) over polynomial ring Z
q
[X]/ (X
n
+1) to avoid the
large parameter. Although Ring-LWE is more practical than
the standard LWE, its algebraic structure might enable threat-
ening attacks [7]. Module-Learning With Errors (Module-
∗
Also with State Key Laboratory of Information Security, Institute of
Information Engineering, CAS, Beijing, China 100095.
†
This author is the corresponding author.
LWE) hardness assumption proposed in [8] provides a trade-
off between security and efficiency with a scalable vec-
tor of polynomials (polyvec) structure. As a competitive
instance, Kyber [9] algorithm fixes the polynomial ring
as Z
7681
[X]/
X
256
+1
. Thus, as the most computation-
ally intensive operation, multiplication over k-dimensional
polyvec can be optimized with a linear algorithm named NTT,
which can reduce computational complexity from O(n
2
) to
O(n log n) [10], [11].
There has been increased interest in implementing Ring-
LWE on FPGAs due to the potential towards high-performance
and compact application scenarios [12]–[17]. A relatively new
Ring-LWE-based scheme, known as NewHope [18], together
with its variant NewHope − Simple [19] were implemented
on Artix7 FPGAs with targeted optimization in [16], [17].
However, we have not found any detailed optimization for
Module-LWE-based schemes on FPGAs so far. Among the
existing works, the high-efficiency implementations like [14]
initialize several processing elements in parallel thus more
arithmetic and memory instances are required. On the other
hand, the NTT algorithm and memory access takes a lot of
clock cycles for compact processors like [12], [13]. Thus,
implementing a both time and area efficient processor is still
a hard work.
In this paper, we design and implement an FPGA-based
processor for operations over polyvec with a good trade-off
between area and performance. The efficiency of lattice-based
schemes makes a significant improvement compared with the
previous hardware implementations. Our contributions are as
follows:
1) We optimize the NTT algorithm with GS butterfly. The
GS butterfly is used both in forward and inverse NTT in
order to utilize the internal DSP adders. The optimization
reduces a total of 29.4% clock cycles for Kyber512
and 33.3% for Kyber1024 compared with the textbook
implementations.
2) We develop dual-column sequential storage and bit-
reversed address accessing. These techniques keep the
datapath free of bubble and avoid redundant latency
978-1-7281-4123-7/20/$31.00
c
2020 IEEE