TI Keystone多核DSP实现高效低成本HEVC编码器

50 浏览量更新于2024-07-14 收藏 1.21MB PDF 举报

该文介绍了一个针对TI Keystone多核TMS320C6678 DSP的高效并行、低成本HEVC（High Efficiency Video Coding）视频编码器的设计与实现。在HEVC编码标准提供比H.264/AVC更高编码效率的同时，其计算复杂度的增加对嵌入式处理器的实时应用提出了挑战。文章通过重新设计CTU级并行的编码器结构、开发低延迟多核数据传输机制以及利用C6000系列SIMD（Single Instruction Multiple Data）指令优化编码瓶颈，成功地在TI TMS320C6678 DSP上实现了显著提升的实时处理能力。文章的核心内容包括以下几点： 1. **HEVC编码器的并行化设计**：为了应对HEVC编码的高计算需求，作者重新设计了编码器结构，引入CTU（Coding Tree Unit）级别的并行性，使得编码过程能够充分利用多核处理器的资源，有效提升处理速度。 2. **低延迟多核数据传输**：为了减少数据在内部L2缓存与外部DDR3内存间传输的延迟，文章提出了一种创新的机制。这一机制旨在优化数据流，降低系统瓶颈，确保在多核环境中的高效通信。 3. **SIMD指令优化**：TI的C6000系列DSP支持SIMD指令，允许单条指令同时处理多个数据。文章利用这一特性，识别并优化了编码过程中的关键瓶颈，特别是那些计算密集型的模块，从而进一步提升编码速度。 4. **实验与性能评估**：实验结果显示，与基于CPU的HM参考软件相比，该HEVC编码器在TMS320C6678 DSP上的运行速度提高了465.50倍，而性能损失仅为0.93dB。这表明该方案在保持良好编码质量的同时，显著提升了实时处理能力，尤其适用于功率受限的实时视频应用。该研究为嵌入式系统的HEVC编码提供了一种高效、并行且成本较低的解决方案，对于推动HEVC在嵌入式设备中的应用具有重要意义。通过充分利用多核架构和硬件特性，实现了编码效率和实时性的双重提升，为未来的视频编码技术提供了有价值的参考。

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2826074, IEEE

Transactions on Circuits and Systems for Video Technology

IEEE T-CSVT Draft

the limited on-chip memory inside TMS320C6678 DSP, only

one CTU can be processed at a time. Assuming that there are

N CTUs in each frame, the four encoding steps are as follows.

First, due to the spatial-temporal referencing mechanism,

for the compress process of each CTU, the encoding frame,

reference frame, the restructured frame are required.

Consequently, 3N video data transmissions and 3N CTU info

transmissions are involved in the compress process of each

frame. 3N video data transmissions include: 1) 2N video data

transmissions of the encoding frame and the reference frame

from DDR3 to L2 memory. 2) N video data transmissions of

the restructured frame from L2 memory to DDR3. 3N CTU

info transmissions are: 1) 2N info transmissions for CTU info

of the current frame and the reference frame from DDR3 to L2

memory 2) N info transmissions for CTU info of the current

frame are from L2 memory to DDR3.

Second, deblocking filter is used to reduce visible artifacts

at block boundaries. All the vertical block boundaries are

filtered by horizontal filter, and then the remaining horizontal

block boundaries are filtered by vertical filter. 4N video data

transmissions and 2N CTU info transmissions are involved in

the deblocking filter of each frame. 2N video data

transmissions and N CTU info transmissions between DDR3

and L2 are necessary for horizontal filter, while the other 2N

video data transmissions and N CTU info transmissions are

needed by vertical filter.

Third, HEVC employs SAO to reduce the difference

between the encoding frame and the reconstructed frame. 3N

video data transmissions and N CTU info transmissions are

involved in the SAO process of each frame, where 3N video

data transmissions are: 1) 2N video data transmissions of the

encoding frame and the restructured frame from DDR3 to L2

are required, which are inputs to SAO; 2) N video data

transmissions of the restructured frame from L2 memory to

DDR3 are required, which are outputs of SAO. For the SAO

of encoding frame, N CTU info transmissions for the CTU

info from DDR3 to L2 memory are required.

Core0

(Master)

Video Data

Control Command

Encoder

Configuration

DDR3

Video Data

Core1

(Slaver)

Core2

(Slaver)

Core3

(Slaver)

Core4

(Slaver)

Core5

(Slaver)

Core6

(Slaver)

…

Bitstream

TI TMS320C6678

Feedback Info.

Bitstream

Video Data

Bitstream

Feedback

Info.

Shared

Memory

(4MB)

Core7

(Slaver)

Coding Info.

Shared

Data

Reconstructed Video

…

Fig. 2. Diagram of TI TMS320C6678-based HEVC encoder.

Lastly, the CTU info of the current frame is transferred into

TMS320C6678 DSP to generate the bitstream with entropy

coding, which contains N CTU info transmissions.

In each of the above four steps, there are a large amount of

data transmission between internal and external memory, and

these data transmissions have a tremendous impact on the

encoding speed on the DSP. In addition to the amount of data,

the following data transmission processes, such as the

frequency of data transmission, the mode of data transmission,

and the parallelism of data transmission, are also very

important to improve the encoding speed. These problems will

be carefully analyzed and addressed later.

In all, there are 10N video data transmissions and 7N CTU

info transmissions in the encoding process of each frame,

which is composed of N CTUs. It means that 17 data

transmissions should be employed in the encoding process of

each CTU. For the optimization of HEVC encoder on DSP,

Problem 1 will be summarized as follows: The frequency of

data transmission between the internal and external memory

is too high, since the internal memory of hardware is limited.

Furthermore, Table I gives the detailed time consumption of

reading/writing DDR3 memory on TMS320C6678. It can be

seen the difference of data access efficiency is dramatic with

different ways as direct access, memcpy, and EDMA. So it is

critical to select an available and reasonable data transmission

for the optimization of HEVC encoder on DSP. Therefore, the

second key problem for the optimization of HEVC encoder on

DSP can be summarized as Problem 2: The time consumption

of data transmission between the internal and external

memory is obvious, which cannot be negligible.

In order to identify the most time-consuming modules in the

DSP-based HEVC encoder, we analyze the execution time of

our HM10.0 [43]-based embedded HEVC encoder on TI

TMS320C6678 DSP. The first 100 frames of the video

sequences in Class B (1080p) are encoded with a QP of 32

under Low Delay P main configuration. The rest of the

configurations are kept unchanged [44]. Table II illustrates the

average time-consumption percentage of key HEVC encoding

modules on TMS320C6678. According to the result, encoding

modules such as interpolation, transform, Quantization and

SAO, occupy the vast majority of the encoding complexity.

Therefore, in order to speedup HEVC encoding and achieve

real-time coding, Problem 3 will be summarized as: The

calculation of these key modules in HEVC standard is

serialized, which is too time-consuming to apply in real-time

systems.

In summary, there are three major problems in developing

the TI Keystone multicore DSP-based highly-paralleled low-

cost fast HEVC encoding solution, each of which will be

carefully studied, well designed and implemented. More

specifically, to address Problem 1, an hardware-friendly

memory-efficient encoding structure will be developed in

TABLE I. DATA ACCESS EFFICIENCY (μs) ON TMS320C6678.

direct access

memcpy

EDMA

YUV

293990.75

180876.68

939.15

PCCU

1351.07

829.36

4.34

Total

295341.82

181706.04

943.49

剩余14页未读，继续阅读

weixin_38596093

粉丝: 3

TI Keystone多核DSP实现高效低成本HEVC编码器

TMS320C66x KeyStone架构多核DSP入门与实例精解.pdf

1-1-嵌入式TI KeyStone C66x多核DSP TMS320C665x开发板处理器硬件解析.pdf

TI多核DSP原理图

大点数FFT算法C6678多核DSP的并行实现.pdf

创龙结合TI KeyStone系列多核架构TMS320C6678 DSP以及Xilinx Kintex-7 FPGA的SOM-TL6678F核心板.pdf

创龙TI KeyStone C66x多核定点/浮点DSP TMS320C665x开发板.pdf

基于KeyStone DSP的多核视频处理技术

创龙基于TI KeyStone C66x多核定点/浮点DSP TMS320C6678 + Xilinx Kintex-7 FPGA开发板.pdf

KeyStone I DSP[C665x 与 C6678] 视频教程 6.3 - 多核 TI IPC 组件.mp4

TI KeyStone DSP多核IPC通信案例开发指南

最新资源