A hybrid CPU/GPU Scheme for Optimizing
ChaCha20 Stream Cipher
Ziheng Wang, Heng Chen*, Weiling Cai
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
wzh009888@stu.xjtu.edu.cn, hengchen@xjtu.edu.cn, withinmiaov@stu.xjtu.edu.cn
Abstract—The secure transmission of large-scale data has
attracted more and more attention. In the widely recognized
security protocol TLSv1.3, the only algorithms that support
large-scale data en-/decryption are ChaCha20 and the Advanced
Encryption Standard (AES). Although AES has a higher usage
rate, ChaCha20 still has the advantage of speed and security
on many platforms, and has a better performance against post-
quantum attacks. However, for a CPU/GPU platform, compared
to the AES algorithm, no work has fully described the application
scheme of ChaCha. This paper proposes an optimization scheme
to optimize the performance of the ChaCha20 algorithm on a
CPU/GPU platform. On a CPU platform, we provide a paral-
lelization implementation that is better than that of OpenSSL. On
a single GPU, our implementation of ChaCha20 achieves peak
throughput of 211.41GB/s, which is better than any previous
implementation of ChaCha20 and AES algorithms on GPU. More
importantly, we are the first to detail the optimization of ChaCha
on GPU. When considering the interconnection between CPU and
GPU, we use the 87.76% peak bidirectional bandwidth of a PCIe
channel. Finally, we also provide a scheme for the application of
ChaCha20 on a CPU/GPU platform.
Index Terms—ChaCha20, GPU, MPI
I. INTRODUCTION
With the development of the Internet, the issue of network
security has attracted more and more attention. At the same
time, the era of big data makes the amount of data increase
rapidly. Whether on storage systems [1], web servers, and fed-
erated learning systems [2], the amount of data is accelerating
and the demand for security is higher. It is becoming more
common to transmit large amounts of encrypted data. In this
case, it is urgent to improve the speed of en-/decryption.
In the widely recognized Transport Layer Security (TLS)
v1.3 protocol [3], the only algorithms that support large-scale
data encryption are ChaCha20 and the Advanced Encryption
Standard (AES) [4]. Although AES has a higher usage rate,
ChaCha20 still has the advantage of speed and security on
many platforms [5], [6], and has a better performance against
post-quantum attacks [7], [8]. However, on a CPU/GPU
platform, compared to the AES algorithm, the work on the
application scheme of ChaCha has not been introduced in
depth. There is not even a work detailing the optimization
of ChaCha on GPU.
ChaCha20 [9], a variant of Salsa20 [10], is a 256-bit stream
cipher [11]. According to the report in [12], the ChaCha stream
* Corresponding author
This research has been supported by the China National Key R&D Program
during the 13th Five-year Plan Period (Grant No. 2018YFB1700405).
cipher is expected to remain secure in 10-50 year lifetime.
Salsa-related cipher such as ChaCha can be used to replace
RC4, which has theoretically proved unsafe [13]. Besides, in
recent years, ChaCha20 has been used as an alternative to
the AES block cipher algorithm to increase en-/decryption
speed [5]. From the perspective of en-/decryption efficiency,
reference [14] described in detail whether ChaCha20 or AES
should be used in different scenarios.
This paper proposes several methods to optimize the perfor-
mance of the ChaCha stream cipher on a CPU/GPU platform.
In order to apply the ChaCha algorithm in practice, we also
provide a scheme on how to use it more effectively on
some platforms. Specifically, this paper makes the following
contributions:
• On multi-core CPU, we use Message Passing Interface
(MPI) and provide a parallelization implementation of
ChaCha20 that is better than that of OpenSSL and any
previous implementation of ChaCha20.
• We are the first to optimize ChaCha by utilizing its
intrinsic characteristics. Specifically, the ChaCha’s block
function is accelerated with a combination of encryption
granularity, coalesced memory access, branch operations
(rather than index method), and inline Parallel Thread
Execution (PTX) techniques. On a single GPU, our
implementation of ChaCha20 achieves peak throughput
of 211.41GB/s, which is better than any previous imple-
mentation of ChaCha20 and AES on GPU.
• When considering the interconnection between CPU and
GPU, we use multi-copy technique and achieve a 87.56%
peak bidirectional bandwidth of a PCIe channel. Simi-
larly, the throughput in this case is better than any other
previous effort.
• We provide a scheme for using Chacha20 on storage
systems, web servers, and federated learning systems, so
that the ChaCha20 stream cipher can achieve higher en-
/decryption efficiency on CPU/GPU platforms.
The rest of this paper is organized as follows. Section II
introduces the ChaCha stream cipher and reviews the related
work. Section III describes scheme about optimize ChaCha20
on CPU/GPU platform. Section IV evaluates our methods, and
Section V concludes this paper.
1171
2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable
Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
978-0-7381-2646-3/21/$31.00 ©2021 IEEE
DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00161