VLSI Design 3
The MediaBreeze SIMD processor was proposed to
reduce the bottlenecks in SIMD implementations [15]. The
Breeze SIMD ISA uses a multidimensional vector able to
speed up nested loops but at the cost of a very complicated
instruction structure requiring a dedicated instruction mem-
ory. In [16], a specific SIMD ISA named VS-ISA was pro-
posed in order to improve performance in video coding. The
authors adopted specific solutions for sum of absolute differ-
ence (SAD), not aligned load applied to ME, interpolation,
DCT-IDCT, and quantization dequantization.
Another typical approach to reduce the SIMD overhead
is the usage of multibank vector memory where data is stored
interleaved. The drawback is the increase of hardware cost for
supporting the addresses generation.
An alternative to SIMD implementation on program-
mable processor architectures is the hardwired processor.
Usually, it is only used when performance and low power
consumption are essential requirements [7, 14, 17]. In fact,
the lack of flexibility typical of hardwired processors reduces
their applicability to a narrow segment of the market, where
the programmability is either not required or considerably
reduced.
3. SIMD ISA Description
In order to optimize the H.264 encoder, we chose three differ-
ent ISAs. The adopted architectures are ST240, xSTream, and
P2012, all developed by STMicroelectronics. The former is
a single-processor archite cture, and the others are multicore
platforms. In the following, the three architectures will
be briefly described, giving special attention to the SIMD
instruction set.
We chose these architectures for their novelty and for the
possibility to have a complete toolchain (code generation,
simulation, profiling, etc.) for developing an application in
an optimal way. Each toolchain allowed a complete observ-
ability of the system. In this way, it was possible to evaluate
the effectiveness of every author’s solution. Observability is
a very important characteristic when developing/optimizing
an application. Using a real system it is not always possible to
reach the degree of observabilit y you have using a simulator
and a suitable toolchain. Moreover, in an architecture under
development as P2012 we had the possibility to contribute to
the SIMD instruction set and, more important, to evaluate
the contr ibution of each particular SIMD to the performance
of the target video codec application. The three instruction
sets present suitable characteristics for our research; they
are generic instruction set, but ST240 includes a few video-
specific instructions; we can analyse the impact of different
vector register sizes; even if xSTream and P2012 share many
characteristics, only xSTream supports horizontal SIMD
(this is a special feature; e.g., other SIMD extensions as Intel
SSE and ARM NEON do not have the same support); in
P2012 platform, we were able to define and insert new SIMD
instructions.
Besides the type of instructions, the SIMD extensions
differ in both size and precision. These differences allow
analyzing the impact of different architecture solutions on
the global performance.
Source 1Source 2
CD BA GH FE
Absubu.pb result
+
Sadu.pb result
|−| |−| |−| |−|
|
D
−
H
||
C
−
G
|
|
B
−
F
|
|
A
−
E
|
Figure 1: SAD operation.
3.1. ST240. The ST240 is a processor of STMicroelectronics
ST200 family based on LX technology jointly developed with
Hewlett Packard [18, 19]. The main ST240’s features are the
following:
(i) 4-issue Very Long Instru ction Word (VLIW)
(ii) 64-32-bit general purpose registers
(iii) 32KB D-Cache and 32KB I-Cache
(iv) 450 MHz clock frequency
(v) 8-bit/16-bit arithmetic SIMD.
In the H.264 encoder SIMD optimization, the most sig-
nificant instructions of the ST240 ISA are the following: the
SIMD add.ph and sub.ph which perform, respectively, the
packed 16-bit addition or subtraction; the perm.pb instruc-
tion which performs byte permutations and the mulad-
dus.pb w hich multiplies an unsigned byte by a signed byte
in each of the byte lanes and then sums across the four
lanes to pro duce a single result. Furthermore, several data
manipulation instructions are defined: pack.pb packs 16-bit
values to byte elements ignoring the upper half; shuffeve.pb
and shuffodd.pb, respectively, perform 8-bit shuffleofeven
and odd lanes. Two averaging operations (avg4u.pb and
avgu.pb) are also defined in the instruction set.
One important operation in video-coding algorithms,
the absolute value of the difference, abs (a-b), can be
performed with the absubu.pb instruction (Figure 1)which
works on each byte lane (treating each byte lane as an
unsigned value) and returns the result in the corresponding
byte lane of the destination register. The sadu.pb (Figure 1)
performs the same operation and then sums the byte lanes
value and returns the result.
3.2. xSTream. xSTream is a multiprocessor dataflow archi-
tecture for high-performance embedded multimedia stream-
ing applications designed at STMicroelectronics [20, 21].
xSTream is constituted by a parallel distributed and
shared memory architecture. It is an array of processing
elements connected by a Network on Chip (NoC) with
specific hardware for management of communication [22],
as depicted in Figure 2.