GPU加速优化： ChaCha20流密码并行实现策略

毕业设计

论文

需积分: 0 30 浏览量更新于2024-08-03 收藏 202KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"A hybrid CPU/GPU Scheme for Optimizing ChaCha20 Stream Cipher" 这篇论文主要探讨了一种混合CPU/GPU方案，旨在优化用于大规模数据安全传输的ChaCha20流密码算法。在广泛认可的安全协议TLSv1.3中，支持大数据量加解密的算法仅有ChaCha20和高级加密标准（AES）。尽管AES使用率较高，但在许多平台上，ChaCha20在速度和安全性方面具有优势，尤其是在对抗后量子攻击时表现更佳。然而，针对CPU/GPU平台，尚未有充分的研究描述如何有效地应用 ChaCha20。论文的创新之处在于提出了一个优化策略，以提升CPU/GPU平台上ChaCha20算法的性能。在CPU平台上，该研究提供了一个比OpenSSL实现更高效的并行化实施方案。在单个GPU上，他们实现的ChaCha20算法能达到211.41GB/s的峰值吞吐量，这显著提高了加密和解密的效率。对于目标读者群体，这篇论文适合以下几类人： 1. GPU编程学习者：论文深入介绍了如何利用GPU的并行计算能力，包括PTX指令集和内存访问优化等关键技术，为GPU编程提供了实践案例。 2. MPI/OpenMP编程者：通过此论文，可以了解到在CPU上的多核和多节点编程方法，以及如何协调CPU与GPU之间的并行计算。 3. 密码学爱好者：可以从中获取关于 ChaCha20 的基础知识，理解其工作原理和安全特性。 4. 对称加密研究者：ChaCha20作为新兴且广泛应用的加密算法，对于那些在没有AES硬件加速的平台上寻求高性能解决方案的研究者来说，具有重要参考价值。这篇论文不仅适用于学术研究，也适合作为学习GPU编程、密码学知识、并行计算和ChaCha20算法的教材。如果需要进一步的信息，可以联系论文的第一作者进行咨询，无论是私下交流还是通过邮件。阅读本文时，建议关注以下几个要点： 1. CPU和GPU并行化方案的设计与实现细节。 2. ChaCha20算法在不同硬件环境下的性能比较。 3. 如何利用GPU加速来优化加密/解密过程。 4. 并行编程技术如PTX和OpenMP在密码学应用中的具体应用。 5. 如何评估并行化方案的效率和安全性。这篇论文为理解和优化ChaCha20算法在现代计算平台上的应用提供了宝贵的见解，对于密码学和并行计算领域的研究者和学生具有很高的参考价值。

资源详情

资源推荐

A hybrid CPU/GPU Scheme for Optimizing

ChaCha20 Stream Cipher

Ziheng Wang, Heng Chen*, Weiling Cai

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China

wzh009888@stu.xjtu.edu.cn, hengchen@xjtu.edu.cn, withinmiaov@stu.xjtu.edu.cn

Abstract—The secure transmission of large-scale data has

attracted more and more attention. In the widely recognized

security protocol TLSv1.3, the only algorithms that support

large-scale data en-/decryption are ChaCha20 and the Advanced

Encryption Standard (AES). Although AES has a higher usage

rate, ChaCha20 still has the advantage of speed and security

on many platforms, and has a better performance against post-

quantum attacks. However, for a CPU/GPU platform, compared

to the AES algorithm, no work has fully described the application

scheme of ChaCha. This paper proposes an optimization scheme

to optimize the performance of the ChaCha20 algorithm on a

CPU/GPU platform. On a CPU platform, we provide a paral-

lelization implementation that is better than that of OpenSSL. On

a single GPU, our implementation of ChaCha20 achieves peak

throughput of 211.41GB/s, which is better than any previous

implementation of ChaCha20 and AES algorithms on GPU. More

importantly, we are the ﬁrst to detail the optimization of ChaCha

on GPU. When considering the interconnection between CPU and

GPU, we use the 87.76% peak bidirectional bandwidth of a PCIe

channel. Finally, we also provide a scheme for the application of

ChaCha20 on a CPU/GPU platform.

Index Terms—ChaCha20, GPU, MPI

I. INTRODUCTION

With the development of the Internet, the issue of network

security has attracted more and more attention. At the same

time, the era of big data makes the amount of data increase

rapidly. Whether on storage systems [1], web servers, and fed-

erated learning systems [2], the amount of data is accelerating

and the demand for security is higher. It is becoming more

common to transmit large amounts of encrypted data. In this

case, it is urgent to improve the speed of en-/decryption.

In the widely recognized Transport Layer Security (TLS)

v1.3 protocol [3], the only algorithms that support large-scale

data encryption are ChaCha20 and the Advanced Encryption

Standard (AES) [4]. Although AES has a higher usage rate,

ChaCha20 still has the advantage of speed and security on

many platforms [5], [6], and has a better performance against

post-quantum attacks [7], [8]. However, on a CPU/GPU

platform, compared to the AES algorithm, the work on the

application scheme of ChaCha has not been introduced in

depth. There is not even a work detailing the optimization

of ChaCha on GPU.

ChaCha20 [9], a variant of Salsa20 [10], is a 256-bit stream

cipher [11]. According to the report in [12], the ChaCha stream

* Corresponding author

This research has been supported by the China National Key R&D Program

during the 13th Five-year Plan Period (Grant No. 2018YFB1700405).

cipher is expected to remain secure in 10-50 year lifetime.

Salsa-related cipher such as ChaCha can be used to replace

RC4, which has theoretically proved unsafe [13]. Besides, in

recent years, ChaCha20 has been used as an alternative to

the AES block cipher algorithm to increase en-/decryption

speed [5]. From the perspective of en-/decryption efﬁciency,

reference [14] described in detail whether ChaCha20 or AES

should be used in different scenarios.

This paper proposes several methods to optimize the perfor-

mance of the ChaCha stream cipher on a CPU/GPU platform.

In order to apply the ChaCha algorithm in practice, we also

provide a scheme on how to use it more effectively on

some platforms. Speciﬁcally, this paper makes the following

contributions:

• On multi-core CPU, we use Message Passing Interface

(MPI) and provide a parallelization implementation of

ChaCha20 that is better than that of OpenSSL and any

previous implementation of ChaCha20.

• We are the ﬁrst to optimize ChaCha by utilizing its

intrinsic characteristics. Speciﬁcally, the ChaCha’s block

function is accelerated with a combination of encryption

granularity, coalesced memory access, branch operations

(rather than index method), and inline Parallel Thread

Execution (PTX) techniques. On a single GPU, our

implementation of ChaCha20 achieves peak throughput

of 211.41GB/s, which is better than any previous imple-

mentation of ChaCha20 and AES on GPU.

• When considering the interconnection between CPU and

GPU, we use multi-copy technique and achieve a 87.56%

peak bidirectional bandwidth of a PCIe channel. Simi-

larly, the throughput in this case is better than any other

previous effort.

• We provide a scheme for using Chacha20 on storage

systems, web servers, and federated learning systems, so

that the ChaCha20 stream cipher can achieve higher en-

/decryption efﬁciency on CPU/GPU platforms.

The rest of this paper is organized as follows. Section II

introduces the ChaCha stream cipher and reviews the related

work. Section III describes scheme about optimize ChaCha20

on CPU/GPU platform. Section IV evaluates our methods, and

Section V concludes this paper.

1171

2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable

Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)

DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00161

下载后可阅读完整内容，剩余7页未读，立即下载

whyte王

粉丝: 1227
资源: 2

GPU加速优化： ChaCha20流密码并行实现策略

使用Tensorflow-GPU禁用GPU设置(CPU与GPU速度对比)

iPad苹果A4处理器揭秘：ARM CPU GPU.pdf

Unable to determine GPU memory usage Unable to determine GPU memory usage [MemUsageChange] Init CUDA: CPU +7, GPU +0, now: CPU 13, GPU 0 (MiB)

RuntimeError: Not compiled with GPU support

OOM when allocating tensor with shape[352,34,176] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Pad]

WARNING:tensorflow:From E:/min/min/1.py:3: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead.

failed copying input tensor from /job:localhost/replica:0/task:0/device:cpu:0 to /job:localhost/replica:0/task:0/device:gpu:0 in order to run _eagerconst: dst tensor is not initialized.

internalerror: failed copying input tensor from /job:localhost/replica:0/task:0/device:cpu:0 to /job:localhost/replica:0/task:0/device:gpu:0 in order to run _eagerconst: dst tensor is not initialized.

ResourceExhaustedError: OOM when allocating tensor with shape[32,32,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

opencv c++调用opencl mali打开gpu加速并于cpu加速做对比，代码实现

翻译 OOM when allocating tensor with shape[60000,32,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Conv2D]

rk3588 opencv调用gpu

Traceback (most recent call last): File "D:\flmy\main.py", line 44, in <module> weight_accumulator[name].add_(diff[name]) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! Process finished with exit code 1这段代码如何解决

查看pytorch版本cpu gpu

ResourceExhaustedError: OOM when allocating tensor with shape[2,1536,1536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Fill]

介绍一下CUDA的runtime 库，和里面一些常用的函数

rk3588 opencv c++调用gpu

WARNING: Skipping tensorflow-gpu as it is not installed

WARNING: Skipping tensorflow-gpu2.4.1 as it is not installed.

最新资源