CUDA持久线程(CuPer)提升Jetson TX2实时性能

52 浏览量更新于2024-07-14 收藏 520KB PDF 举报

"这篇白皮书《通过CUDA持久线程（CuPer）在Jetson TX2上提升实时性能》由咨询软件工程师Todd Allen撰写，旨在探讨如何利用CUDA持久线程技术来增强基于Jetson TX2平台的实时软件的性能。文章于2018年3月发布，详细介绍了GPU（图形处理器单元）在实时应用中的作用，特别是针对毫秒级以下帧持续时间约束的挑战。" 在实时软件开发领域，开发者越来越倾向于利用GPU的并行计算能力，如CUDA编程模型，来执行复杂的并行计算任务。然而，GPU在历史上的一个主要问题是确定性较差，这限制了它们在对帧持续时间有严格要求的实时应用中的使用。近年来，虽然这个问题有所改善，但对于那些帧间隔非常短（可能低至100微秒）的应用来说，仍然存在挑战。 CUDA持久线程（CuPer）是一种可以显著提高确定性的方法，使得中等规模的工作负载适用于这类实时应用。文章中提出了一种基于CUDA的简单API，该API设计用于实现这种编程风格，并展示了使用此API时的时序结果。通过CUDA持久线程，可以在GPU上下文中保持工作线程的存活状态，从而减少线程创建和销毁带来的开销，提高响应速度和确定性。文章详细讨论了如何利用CUDA持久线程技术优化实时应用的性能。首先，解释了CUDA持久线程的概念，以及它如何与传统的CUDA执行模型（如流和作业队列）相结合。接着，作者可能分析了在Jetson TX2这样的嵌入式平台上，CuPer如何有效利用硬件资源，降低延迟并提升吞吐量。此外，可能还涵盖了在实时系统中如何管理和调度这些持久线程，以确保满足严格的时序约束。在实验部分，白皮书可能报告了一系列基准测试和案例研究，这些研究展示了CuPer在实际应用中的性能改进。测试结果可能包括不同工作负载大小、并发线程数和实时性能指标，如最大帧率、抖动和响应时间等。这些数据有助于读者理解在不同场景下，CuPer如何提升系统的实时性能。这篇白皮书对于希望在嵌入式设备上利用GPU进行实时计算的开发者来说，是一份有价值的参考资料。它不仅提供了理论背景，还通过实际示例和性能评估，帮助读者理解和应用CUDA持久线程技术，以优化对毫秒级响应时间敏感的系统。

Improving Real-Time Performance with CUDA Persistent Threads (CuPer) on the Jetson TX2 Page 6

closely mirrors that of the launch/synchronization method. It is intentionally simple to provide

the best determinism.

Without the explicit launch to start a workload and synchronization to detect when the

workload has been completed, other synchronization primitives are needed. It is possible to

implement this manually. However, the RedHawk CUDA Persistent Threads (CuPer) API

provides a simple API to abstract the primitive operations.

RedHawk CUDA Persistent Threads (CuPer) API

The standard interfaces are provided in the <cuper.h> header file. All elements are declared

within the Cuper::Std namespace. There are three classes defined therein: Cpu, Cuda1Block,

and CudaMultiBlock. An object of the Cpu class is created in CPU source code. A typical

usage would have a form similar to this:

void cpuFunction (…)

{

Cuper::Std::Cpu p;

cudaHostGetDevicePointer(&d_A, h_A);

Persistent<<<blocksPerGrid, threadsPerBlock>>>(p.token(), d_A);

for (…) {

… initialize h_A …

p.startCuda();

… possibly do unrelated CPU work …

p.waitForCuda();

… use results in h_A …

}

p.terminateCuda();

}

The CUDA kernel in this example is called Persistent. It is launched only once before entering the

main loop. Any buffer(s) that will be used to pass user data back and forth during normal operation

must be specified at that time. In addition, the Cuper::Std::Cpu object provides a token(), the

value of which also must be passed.

The object, p, of the Cuper::Std::Cpu class is used to control the execution of workloads within the

CUDA kernel. Within the main loop, p.startCuda informs the CUDA GPU that the input buffers are

prepared and that it should begin performing its workload. This is analogous to a CUDA kernel launch.

p.waitForCuda causes the CPU to wait for the work on the GPU to be completed. This is analogous

to a CUDA synchronize.

If it is desired for the main loop ever to exit, p.terminateCuda may be called to request that.

剩余31页未读，继续阅读

weixin_38558623

粉丝: 4
资源: 930

CUDA持久线程(CuPer)提升Jetson TX2实时性能

improving-network-performance-in-multi-core-systems-paper

Improving WS-Security Performance with a Template-Based Approach

Introduction to Space-Time wireless Communications

AI-Assisted Low Information Latency Wireless Networking

improving sea-thru with monocular depth estimation methods

the unauthenticated git protocol on port 9418 is no longer supported. please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.

fatal: remote error: the unauthenticated git protocol on port 9418 is no longer supported. please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.

Multi-Scale Retinex with Color Restoration

oracle dgarch

Improving Generalization Performance by Switching from Adam to SGD

最新资源