Efficient SIMD Acceleration of DCT and IDCT for High Efficiency Video Coding
Lingyu Li, Xiaoyun Zhang, Zhiyong Gao
Institute of Image Communication and Network Engineering
Shanghai Jiao Tong University
Shanghai, China
{lilingyu, xiaoyun.zhang, zhiyong.gao}@sjtu.edu.cn
Abstract—The promising High Efficiency Video Coding
(HEVC) standard aims at much higher coding efficiency, but
the cost is the greatly increased computation complexity.
Among all the coding modules, DCT and IDCT are frequently
called and bring a lot of complexity burden. While, the Single
Instruction Multiple Data (SIMD) technique has been widely
used to speed up media data process. In this paper, we focus on
SIMD acceleration of HEVC DCT and IDCT, and the
implementation is conducted on 64-bit multicore platform. For
DCT optimization, intermediate variables are processed in less
bit width and parallel processing level is significantly improved
with little compression efficiency loss. Experiment results on
64-bit Tilera multicore platform exhibit that the proposed
SIMD implementation can greatly reduce about 40%-70%
computational complexity of DCT and IDCT with negligible
compression performance loss.
Keywords- simd; dct; idct; hevc; tilera
I. INTRODUCTION
The Joint Collaborative Team on Video Coding (JCT-VC)
has published the next generation video coding standard
referred to as High Efficiency Video Coding (HEVC).
HEVC is expected to provide 200% compression efficiency
over the current standard H.264. However, high compression
efficiency is achieved at the cost of much more
computational complexity, which has become a serious
problem for the real-time video codec [1].
In video codec, Discrete Cosine Transform (DCT) plays
a vital role in video compression. The transform tool in
HEVC is far more complicated than the H.264 standard.
HEVC can support various transform sizes ranging from 4x4
to 32x32. Besides, for the transform of 4x4 intra TU
(transform unit) of luma component, an approximation to the
discrete sine transform (DST) is applied [2]. Compared with
H.264, the coefficients of DCT transform more complicated,
which results in more multiplications instead of shifts and
additions. In addition, the intermediate variables need more
bit width, which reduces the parallel processing level.
In recent years, most modern processors provide media
instructions to improve the computational performance.
Single instruction multiple data (SIMD) is a very efficient
tool in processing media data. By exploiting the data-level
parallelism, SIMD technologies provide a series of effective
approaches for fast algorithm implementation. On Intel
platform, the MMX/SSE technology is a typical example of
SIMD [10]. On popular Intel and ARM platforms, there are
some SIMD based algorithms for HEVC[5-6]. The TILE-
Gx36 is a system-on-chip 36-core processor of Tilera family
[12]. Each of the 36 processor cores is a full-fledged 64-bit
processor. The processor instruction set architecture (ISA)
includes a rich set of SIMD instructions. Due to the low
power consumption and effective parallel processing ability,
some HEVC codec implementation work has been done on
Tilera platform [7-9].
We focus our work on the SIMD acceleration of HEVC
DCT and IDCT modules on Tilera multicore platform. The
rest of this paper is organized as follows. An introduction of
HEVC DCT and IDCT is given in Section II. Section III
presents our proposed SIMD acceleration method of HEVC
DCT and IDCT on Tilera platform. Section IV provides the
acceleration results. At last, Section V concludes the paper.
II. INTRODUCTION OF HEVC DCT AND IDCT
Similar to previous video coding standards, DCT and
IDCT modules in HEVC are used to transform the prediction
residues to eliminate spatial redundancy. Two-dimensional
DCT and IDCT are calculated by applying horizontal and
vertical one-dimensional DCT and IDCT. The DCT and
IDCT of HEVC can support different TB (transform block)
sizes: 4x4, 8x8, 16x16 and 32x32.
Figure 1. Coefficient matrix of 16x16 DCT
In H.264, the one-dimensional transform can be
implemented by matrix multiplication and dot product. First,
the multiplication of the input matrix and the kernel
transform matrix is calculated. The coefficients of kernel
transform matrix are +1, -1, +2, -2, which indicates the
computation of the matrix multiplication can be done by
addition and shift operations. Then, compute the dot product
of the kernel transform result and a scaling matrix and this