HEVC DECODER ACCELERATION ON MULTI-CORE X86 PLATFORM
Bingjie Han, Ronggang Wang, Zhenyu Wang, Shengfu Dong, Wenmin Wang, Wen Gao
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School
ABSTRACT
In this paper, we propose a hybrid parallel decoding strategy
for HEVC which combines task-level parallelism and data-
level parallelism based on CTUs. The data-level parallelism
makes the execution time distribution of different decoding
stages more balanced, and makes the task-level parallelism
more efficient. Our approach imposes no constraint on bit
streams that they shall be generated by optional parallel
coding tools such as tiles or WPP, so it can be applied for all
kinds of HEVC bit streams. Furthermore, SSE, a typical
SIMD instruction set on X86 platform, is utilized to
accelerate time-consuming modules, which shortens the
execution time gaps between different stages and make them
in favor of parallel processing. We have implemented these
acceleration strategies on HM-10.0 decoder, and a great
speed-up ratio is achieved.
Index Terms— HEVC, video decoder, parallel
processing, SIMD
1. INTRODUCTION
High Efficiency Video Coding (HEVC) is the latest joint
video coding standardization project of the Joint
Collaborative Team on Video Coding (JCT-VC) which is
established by ITU-T Video Coding Experts Group and
ISO/IEC Moving Picture Experts Group. The first edition of
the HEVC standard is finalized in January 2013, and it
achieves about 50% lower bit rate than H.264/AVC for the
same subjective quality [1]. The HEVC test Model (HM)
decoder is an example implementation following the HEVC
decoding standard. Aimed at correctness, completeness and
readability, it doesn’t use any parallelization techniques.
Nowadays it is common that a PC has a dual-core CPU or
quad-core CPU which supports Simultaneous Multithreading
(SMT) meanwhile so that a suitable parallel decoding
strategy is expected to achieve significant performance
improvement on PCs. Besides, since Intel introduced the
Streaming SIMD Extensions (SSE) on the Pentium III, the
SIMD instructions have been supported well on PCs.
Parallel decoding strategies can be classified into two
categories: task-level parallelism and data-level parallelism.
Task-level parallelism is to divide a decoder into several
sub-tasks and to attach each sub-task to a separate thread. To
maximize the degree of parallelism, the execution time of all
the sub-tasks is expected to be as close as possible. A task-
level parallelism strategy which shortens execution time gap
between different sub-tasks by adjusting size of blocks that a
sub-task processes is proposed in [2], but it does not resolve
the problem that the second sub-task is always consuming
more time than other sub-tasks. Data-level parallelism is to
process multiple data units in parallel by attaching each data
unit to a separate thread. The data unit can be group of
picture (GOP), frame, slice, slice segment, tile, coding tree
unit (CTU) and so on. The granularities of GOP and frame
are so large that parallelism based on them will lead to a
long delay. The slice, slice segment and tile may be suitable
parallelism granularities, but boundaries of them break up
the connection of context models in entropy decoding and
may also cut off the prediction dependency, which decreases
the coding efficiency. Besides slice segment and tile,
Wavefront Parallel Processing (WPP) is also adopted in
HEVC, and it achieves a better balance between parallel
granularity and coding performance loss than slice- and tile-
level parallelism. Several approaches have been proposed to
decode HEVC bit streams in parallel. For example, [3]
proposes a parallelization strategy based on entropy slices
which is similar to slice segments, and [4] introduces a
parallelization approach called Overlapped Wavefont based
on WPP. But, these approaches can only be applied for
specific bit streams with corresponding parallel decoding
mechanism support. Furthermore, some other approaches
utilize data-level parallelism based on self-defined blocks. In
[5], a data-level parallelism strategy based on inverted Z-
shaped blocks is used on H.264/AVC decoders. This method
simplifies dependencies between different threads, but it
doesn’t make full use of parallelism between different stages.
In this paper, we propose a new parallel decoding
strategy for HEVC which combines task-level parallelism
and data-level parallelism based on CTUs. Data-level
parallelism makes execution time of different stages close
and task-level parallelism makes full use of parallelism
between different stages. This strategy can be applied for all
kinds of HEVC bit streams without any constraint on coding
tools. What’s more, SSE optimization on time-consuming
modules is utilized, and the execution time of different
stages is more balanced after SSE optimization.
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)
978-1-4799-2893-4/14/$31.00 ©2014 IEEE 7353