NVIDIA TX1在实时计算机视觉任务中的性能评估

需积分: 7 91 浏览量更新于2024-09-07 收藏 443KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇论文对NVIDIA的TX1嵌入式平台在支持实时计算机视觉工作负载方面的性能进行了评估。在自动驾驶车辆等对安全至关重要的实时系统中，需要在严格的尺寸、重量和功率（SWaP）限制下提供强大的计算能力。TX1作为一个多核加上GPU的解决方案，被NVIDIA大力推广用于自主系统。然而，之前没有公开的研究评估过TX1在处理关键实时任务时的效果。论文通过基准测试和黑盒评估方法，深入探讨了TX1的性能表现。" 在这篇名为“TX1 for Supporting Real-time Computer-Vision Workloads”的研究中，作者对NVIDIA TX1平台在实时计算机视觉任务中的应用进行了详尽的分析。TX1是一款由NVIDIA推出的面向嵌入式视觉系统的多核处理器，集成了GPU加速器，特别适合在资源受限的环境中，如自动驾驶汽车，进行复杂的视觉处理任务。首先，论文强调了在自动驾驶汽车等安全关键系统中，需要在保持低功耗的同时提供高性能计算能力的重要性。TX1的设计目标正是满足这样的需求，它结合了多核CPU和GPU，能够有效地执行并行计算任务，这对于计算机视觉算法来说至关重要，因为这些算法往往需要大量的并行处理能力来实时解析图像数据。论文通过基准测试，如OpenCV库的性能测试、图像处理和对象检测任务，以及对TX1硬件资源的利用率分析，来评估其在实时计算机视觉工作负载下的表现。这包括了对CPU和GPU的性能比较，以及在不同负载条件下的响应时间和稳定性。此外，黑盒评估则侧重于系统整体性能，而不关注底层实现细节，这有助于理解TX1在实际应用环境中的表现。在测试过程中，研究人员可能会关注TX1的能效比，即在完成特定视觉任务时所消耗的能量与提供的处理能力之间的关系。此外，他们还会考察TX1在多任务环境下的表现，例如同时运行导航、障碍物检测和其他辅助驾驶功能时的性能。这篇论文为理解NVIDIA TX1在实时计算机视觉领域的适用性提供了宝贵的数据和见解。它不仅有助于开发者选择合适的硬件平台，也为未来设计更高效、更优化的嵌入式视觉系统提供了参考。通过深入的性能测试和评估，研究者揭示了TX1在处理安全关键实时任务时的能力和限制，这对推动自动驾驶技术的发展具有重要意义。

资源详情

资源推荐

in which an additional lower power, lower performance

quad-core ARM A53 is provided on chip, but is not directly

accessible to software and is only activated in low-power

modes. The ARM CPUs and the GPU share 4 GB of

1600-MHz DRAM memory partitioned into 32 banks.

The TX1 features an integrated GPU. Such a GPU tightly

shares DRAM memory with CPU cores, typically draws

between 5 and 15 watts, and requires minimal cooling and

additional space. The alternative to an integrated GPU is a

discrete GPU. Discrete GPUs are packaged on adapter cards

that plug into a host computer bus, have their own local

DRAM memory that is completely independent from that

used by CPU cores, typically draw between 150 and 250

watts, need active cooling, and occupy substantial space.

B. CUDA Programming Fundamentals

The following is a high-level description of CUDA, the

API for GPU programming provided by NVIDIA.

A GPU is fundamentally a co-processor that performs

operations requested by CPU programs. CUDA programs

use a set of C or C++ library routines to request GPU

operations that are implemented by a combination of

hardware and device-driver software. The typical structure

of a CUDA program is as follows:

(i)

allocate GPU-local

(device) memory for data;

(ii)

use the GPU to copy data

from host memory to GPU device memory;

(iii)

launch

a program, called a kernel, to run on the GPU cores to

compute some function on the data;

(iv)

use the GPU to

copy output data from device memory back to host memory;

(v)

free the device memory. When invoking a CUDA kernel,

the programmer speciﬁes the number of GPU threads to

use during the kernel’s execution and how the threads

are organized into groups called thread blocks. Having

multiple threads executing the kernel enables the signiﬁcant

parallelism afforded by GPUs to be exploited. Kernel

launches are always asynchronous, requiring the invoking

CPU process to explicitly wait for them to complete.

On integrated GPUs, CUDA provides a zero-copy option

where programs can simply pass a pointer to shared memory

where data used by a kernel is located—that is, explicit

copying from CPU-local memory to GPU-local memory is

avoided. CUDA also supports a different memory-access

mechanism, called uniﬁed memory, on both discrete and

integrated GPUs. Uniﬁed memory is similar to zero-copy

memory, as a single memory pointer can be used in both

CPU and GPU code. The difference between uniﬁed and

zero-copy memory appears during kernel execution, where,

in the case of uniﬁed memory, the GPU driver transparently

transfers data on-demand between CPU-local memory and

GPU-local memory.

CUDA operations pertaining to a given GPU are ordered

by associating them with a stream. By default, there is

a single stream for all programs that share a GPU, but

multiple streams can be optionally created. Operations in

a given stream are executed in FIFO order, but the order

of execution across different streams is determined by the

GPU scheduling in the device driver. Tasks from different

streams may even execute concurrently or out of request

order.

Programmers can think of a GPU as being abstractly

composed of one or more copy engines

(

CEs

)

that imple-

ment transfers of data between host memory and device

memory, and an execution engine

(

)

(consisting of many

parallel processors) that executes GPU kernels. The TX1

has a single CE. EEs and CEs operate concurrently. When

there are multiple streams, kernels and copy operations from

different streams can also operate concurrently depending

on the GPU hardware. To the best of our knowledge,

complete details of kernel attributes and policies used by

NVIDIA to schedule kernels and copy operations is not

available.

C. Related Work

The black-box nature of GPU programming has limited

both the scheduling and analysis techniques available for

real-time GPU usage. As a result, much prior work treats

a single GPU as an atomic entity—a real-time task locks

an entire GPU, or individual EEs or CEs, for the duration

of any GPU computation. Such an approach is taken in

TimeGraph [

], RGEM [

], GPUSync [

], and several

other frameworks [

]. The viewpoint taken in

all of this work is that GPU co-scheduling must be avoided

because concurrently executing kernels might adversely

interfere with each other. However, we are aware of no work

directed at real-time systems in which such interference is

actually demonstrated or its effects quantiﬁed.

In a precursor to this paper, our group conducted an

investigation of the high-level effects of uncontrolled co-

scheduling on the execution times of a variety of image-

processing benchmarks [

]. We conducted this work

using both the NVIDIA TX1 and TK1 (a similar, but

weaker, single-board computer). This work found that

unmanaged co-scheduling can lead to improved average-

case performance. However, we did not examine in depth

how this beneﬁt is achieved or what the limitations of it

are.

Work has also been directed at splitting GPU tasks into

smaller sub-tasks to approximate preemptive execution or

improve utilization [

]. A framework called

Kernelet [

] falls into this category, but is of particular

interest to us due to the fact that GPU co-scheduling

is considered in order to improve utilization. Kernelet,

however, requires heavy instrumentation and does not

consider co-scheduling unmodiﬁed workloads. Additionally,

the developers of Kernelet do not provide an in-depth

investigation into the GPU’s actual behavior or interference

effects during co-scheduling, which, in fairness, was not

one of their main objectives. Others have published further

剩余10页未读，继续阅读

xmcnfire

粉丝: 0
资源: 7

NVIDIA TX1在实时计算机视觉任务中的性能评估

Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA:Packt发布的具有OpenCV和CUDA的动手GPU加速计算机视觉

TX1开发全套资料-原理图-数据手册

type-c pin定义

为什么B={x | x ≥ 1}是凸集

jetson tx1 硬件原理图

JETSON TX1 安装vscode

Jetson tx1的架构

jetson tx1原理图

x1*x2如何凸优化

TX1的cuda可以删掉吗

TX1 pytorch

TX1安装anaconda

lan8720ai-cp-tr引脚

TX1 RX0是什么意思

A2B如何配置TX0 和TX1

RX1 TX1是什么引脚

Jetson tx1部署YOLOv5

用matlab仿真：对于成功概率为p的伯努利试验，熵是p的凸函数。

最新资源