多核SIMD优化实现H.264/AVC编码器

需积分: 10 191 浏览量更新于2024-07-21 收藏 1.85MB PDF 举报

"An Efficient Multi-Core SIMD Implementation for H.264AVC Encoder." 本文主要探讨了在不同架构上优化H.264/AVC编码器的过程，包括多核和单核架构，以及具有不同向量寄存器大小的SIMD（Single Instruction Multiple Data）指令集。针对高清分辨率下实时编码的挑战，代码优化变得至关重要。作者将编码器分解为功能模块，以便更好地理解哪些部分最需要优化，并详细评估性能提升。文章指出了并行化视频编码器和SIMD优化时常见的问题。对于所有架构，作者提出了他们的解决方案。除了展示高效的视频编码器实现，论文的主要目的是讨论不同架构和SIMD指令集的特性如何影响目标应用的性能。作者提供了实现的速度提升结果，以便比较不同实现并评估适用于当前和下一代视频编码算法的最佳方案。 H.264/AVC是一种高效能的视频压缩标准，广泛应用于高清视频编码，其复杂性意味着在实时处理中需要高性能计算。SIMD技术允许一次操作处理多个数据，大大提高了处理效率，尤其在处理大量像素数据时，如视频编码中的运动补偿和变换等步骤。在多核架构中，通过任务分配和数据同步，可以实现编码过程的并行化。每个核心可以独立处理一部分编码任务，如宏块的编码或熵编码。然而，多核环境下的通信开销和负载均衡是需要解决的关键问题。 SIMD指令集则提供了一种在单个时钟周期内处理多个数据元素的方法，这对于处理图像和视频数据非常有效。不同架构的SIMD指令集可能有不同的向量寄存器大小，这会影响它们处理数据的宽度和效率。例如，更宽的寄存器可以一次处理更多的像素，从而提高编码速度。在论文中，作者对优化策略进行了深入分析，包括代码重构、循环展开、数据预取等技术，以利用SIMD指令集的优势。他们还可能讨论了如何在多核系统中有效地分割编码任务，以最大化并行处理能力。最后，通过比较不同实现的性能，作者得出结论，指出在特定硬件配置下，哪种方法最适合实现H.264/AVC编码的实时性和效率。这为未来视频编码器设计提供了有价值的指导，特别是对于服务可伸缩视频编码（SVC）等新兴技术，需要在保持质量的同时，适应各种网络条件和设备性能。

VLSI Design 3

The MediaBreeze SIMD processor was proposed to

reduce the bottlenecks in SIMD implementations [15]. The

Breeze SIMD ISA uses a multidimensional vector able to

speed up nested loops but at the cost of a very complicated

instruction structure requiring a dedicated instruction mem-

ory. In [16], a speciﬁc SIMD ISA named VS-ISA was pro-

posed in order to improve performance in video coding. The

authors adopted speciﬁc solutions for sum of absolute diﬀer-

ence (SAD), not aligned load applied to ME, interpolation,

DCT-IDCT, and quantization dequantization.

Another typical approach to reduce the SIMD overhead

is the usage of multibank vector memory where data is stored

interleaved. The drawback is the increase of hardware cost for

supporting the addresses generation.

An alternative to SIMD implementation on program-

mable processor architectures is the hardwired processor.

Usually, it is only used when performance and low power

consumption are essential requirements [7, 14, 17]. In fact,

the lack of ﬂexibility typical of hardwired processors reduces

their applicability to a narrow segment of the market, where

the programmability is either not required or considerably

reduced.

3. SIMD ISA Description

In order to optimize the H.264 encoder, we chose three diﬀer-

ent ISAs. The adopted architectures are ST240, xSTream, and

P2012, all developed by STMicroelectronics. The former is

a single-processor archite cture, and the others are multicore

platforms. In the following, the three architectures will

be brieﬂy described, giving special attention to the SIMD

instruction set.

We chose these architectures for their novelty and for the

possibility to have a complete toolchain (code generation,

simulation, proﬁling, etc.) for developing an application in

an optimal way. Each toolchain allowed a complete observ-

ability of the system. In this way, it was possible to evaluate

the eﬀectiveness of every author’s solution. Observability is

a very important characteristic when developing/optimizing

an application. Using a real system it is not always possible to

reach the degree of observabilit y you have using a simulator

and a suitable toolchain. Moreover, in an architecture under

development as P2012 we had the possibility to contribute to

the SIMD instruction set and, more important, to evaluate

the contr ibution of each particular SIMD to the performance

of the target video codec application. The three instruction

sets present suitable characteristics for our research; they

are generic instruction set, but ST240 includes a few video-

speciﬁc instructions; we can analyse the impact of diﬀerent

vector register sizes; even if xSTream and P2012 share many

characteristics, only xSTream supports horizontal SIMD

(this is a special feature; e.g., other SIMD extensions as Intel

SSE and ARM NEON do not have the same support); in

P2012 platform, we were able to deﬁne and insert new SIMD

instructions.

Besides the type of instructions, the SIMD extensions

diﬀer in both size and precision. These diﬀerences allow

analyzing the impact of diﬀerent architecture solutions on

the global performance.

Source 1Source 2

CD BA GH FE

Absubu.pb result

Sadu.pb result

|−| |−| |−| |−|

−

Figure 1: SAD operation.

3.1. ST240. The ST240 is a processor of STMicroelectronics

ST200 family based on LX technology jointly developed with

Hewlett Packard [18, 19]. The main ST240’s features are the

following:

(i) 4-issue Very Long Instru ction Word (VLIW)

(ii) 64-32-bit general purpose registers

(iii) 32KB D-Cache and 32KB I-Cache

(iv) 450 MHz clock frequency

(v) 8-bit/16-bit arithmetic SIMD.

In the H.264 encoder SIMD optimization, the most sig-

niﬁcant instructions of the ST240 ISA are the following: the

SIMD add.ph and sub.ph which perform, respectively, the

packed 16-bit addition or subtraction; the perm.pb instruc-

tion which performs byte permutations and the mulad-

dus.pb w hich multiplies an unsigned byte by a signed byte

in each of the byte lanes and then sums across the four

lanes to pro duce a single result. Furthermore, several data

manipulation instructions are deﬁned: pack.pb packs 16-bit

values to byte elements ignoring the upper half; shuﬀeve.pb

and shuﬀodd.pb, respectively, perform 8-bit shuﬄeofeven

and odd lanes. Two averaging operations (avg4u.pb and

avgu.pb) are also deﬁned in the instruction set.

One important operation in video-coding algorithms,

the absolute value of the diﬀerence, abs (a-b), can be

performed with the absubu.pb instruction (Figure 1)which

works on each byte lane (treating each byte lane as an

unsigned value) and returns the result in the corresponding

byte lane of the destination register. The sadu.pb (Figure 1)

performs the same operation and then sums the byte lanes

value and returns the result.

3.2. xSTream. xSTream is a multiprocessor dataﬂow archi-

tecture for high-performance embedded multimedia stream-

ing applications designed at STMicroelectronics [20, 21].

xSTream is constituted by a parallel distributed and

shared memory architecture. It is an array of processing

elements connected by a Network on Chip (NoC) with

speciﬁc hardware for management of communication [22],

as depicted in Figure 2.

剩余14页未读，继续阅读

limjk83215

粉丝: 0
资源: 6

多核SIMD优化实现H.264/AVC编码器

jdk-8u241-linux-arm64-vfp-hflt.tar.gz

基于龙芯SIMD技术的H.264视频解码优化.pdf

Pillow_SIMD-9.0.0.post0-cp39-cp39-win32.whl.rar

Pillow_SIMD-9.0.0.post0-cp310-cp310-win32.whl.rar

Pillow_SIMD-9.0.0.post0-cp38-cp38-win32.whl.rar

如何在多核架构下通过SIMD优化提升H.264/AVC编码器在实时高清视频编码中的性能表现？

单片机C语言实例--303-内部函数intrins.h应用举例.zip

Pillow_SIMD-9.0.0.post0-cp37-cp37m-win32.whl.rar

Pillow_SIMD-5.3.0.post0-cp34-cp34m-win32.whl.rar

Pillow_SIMD-7.0.0.post3-cp36-cp36m-win32.whl.rar

最新资源