GPU-ArraySort：并行无损排序大量数组的算法

10 浏览量更新于2024-07-14 收藏 998KB PDF 举报

GPU-ArraySort是一项由Muaaz Awan和Fahad Saeed在2016年秋季发表在Western Michigan University的Parallel Computing and Data Science Lab Technical Reports中的创新技术报告。该研究专注于开发一种并行且原地排序算法，旨在高效地处理大量数组的排序问题。"原地"意味着算法执行过程中不会额外占用大量的存储空间，这对于内存有限或者追求高效率的场景尤其重要。这篇论文的核心贡献在于设计了一种利用图形处理器（GPU）进行大规模并行计算的排序策略。GPU通常拥有数千个处理核心，能同时执行大量的计算任务，这使得GPU-ArraySort在处理大规模数据集时展现出卓越的性能优势。通过将数据分布到GPU的不同单元并行处理，算法能够显著缩短排序时间，尤其是在大数据处理领域，如机器学习、数据分析或高性能计算中，排序是常见的预处理步骤。该算法的设计考虑了GPU的特性，如并行性、局部性和可扩展性。作者可能使用了诸如CUDA或OpenCL等GPU编程模型来实现这种算法，这些工具允许开发者编写能在GPU上运行的高效代码。为了评估其性能，论文可能会包含实验结果，展示了GPU-ArraySort与传统CPU排序算法（如快速排序、归并排序等）在不同规模数据上的对比，以及在不同硬件环境下的速度提升。此外，论文可能探讨了算法的复杂度分析，包括时间复杂度和空间复杂度，以及如何优化算法以适应不同的硬件配置。论文还可能讨论了潜在的应用场景，比如云计算环境中的分布式计算任务，或者是实时数据流处理中对快速排序的需求。 GPU-ArraySort是计算机科学领域的一个实用且有竞争力的解决方案，它展示了如何利用GPU的并行计算能力来提升大规模数组排序的效率，这对当前和未来的高性能计算和数据密集型应用具有重要意义。通过阅读这篇论文，读者可以深入了解如何在实际工程环境中有效地利用GPU资源，以及如何设计出能够充分利用现代硬件特点的高效算法。

with maximum exploitation of shared memory. Their design utilizes an eﬃcient GPU based

algorithm for calculating preﬁx-sum. A GPU based version of sample sort has been presented

in [6], this is the ﬁrst GPU based study of sample sort technique. Their design consist of

more involved techniques such as use of predication to avoid branch divergence thus exploiting

the ﬁne-grained parallelism. This along with introduction of a binary search tree structure

for traversal of splitter elements makes this algorithm a highly optimized GPU based sorting

technique.

Also in 2009 a GPU accelerated version of quick sort was introduced [19], the design of GPU

quick sort also follows the conventional breakdown of a large list into smaller lists such that

each small list takes a size equivalent to that of shared memory of GPU. This technique enables

the algorithm to take the most out of the fastest memory on GPU. The design also ensures that

consecutive threads are always accessing adjacent memory locations to ensure coalesced global

memory accesses. Authors claim that their design was the fastest sorting algorithm available

at that time.

Recently an improvement in Odd-Even Sorting algorithm [20] has been introduced, this

algorithm focuses on improving the eﬃciency of GPU based Odd-Even Sorting algorithm espe-

cially making it more feasible for CUDA programming model. Their results show considerable

improvement over existing versions of the algorithm.

All of the algorithms discussed above are dedicated sorting algorithms for a 1-dimensional

list containing large number of elements. In order to sort several thousand smaller arrays using

existing algorithms each array would have to be sorted one after the other thus making the

process sequential in nature. The 1-dimensional sorting algorithms which oﬀer the option of

performing a stable sort on the elements with respect to an array of keys, can be employed to

sort multiple arrays, using a make-shift methodology. NVIDIAs Thrust library oﬀers one such

option of sorting a given array with respect to an array of keys in a stable manner. Sorting

large number of arrays using this methodology has been discussed in section VII. We call this

technique; Sorting using Tagged Approach (STA). However this technique is very ineﬃcient,

it performs a lot of redundant functions and uses about three times more memory than is

actually required. We present an algorithm dedicated for sorting large number of arrays in

parallel, utilizing full potential of a GPU. Our algorithm is capable of sorting much larger

number of arrays in a much shorter time as compared to the STA approach.

3 Graphic Processing Units and CUDA programing model

Graphic processing units were ﬁrst introduced as dedicated graphics computing unit [21]. Ca-

pable of performing transforms and lighting with hardware accelerated support. This revolu-

tionized the gaming and graphics industry. Highly parallel architecture of a GPU provided a lot

more resources for compute intense problems, especially those related to graphics generation.

A Graphic Processing Unit consists of several Streaming Multiprocessors (SMs). Initially

剩余19页未读，继续阅读

weixin_38530211

粉丝: 1
资源: 970

GPU-ArraySort：并行无损排序大量数组的算法

A Communication-Efficient Parallel Algorithm for Decision Tree

Skiena-The_Algorithm_Design_Manual.pdf

The Algorithm Design Manual (2rd Edition)

MATLAB Function Performance Analysis: Tips for Identifying and Eliminating Performance Bottlenecks

Optimization Tips for OpenCV with Python: 10 Secrets to Enhance Image Processing Efficiency

Expansion and Development of MATLAB Toolboxes: Building Customized Toolboxes to Meet Your Custom ...

【Advanced】Simulation of Wind Turbine Fault Detection Based on MATLAB, Simulink, and Python

【Advanced Chapter】Sparse Matrix Techniques: Storage Optimization and Computation Methods in MATLAB

MATLAB Multi-Objective Optimization: Case Studies from Theory to Practice

VB航空公司管理信息系统 (源代码+系统)(2024it).7z

最新资源