1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2826074, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE T-CSVT Draft
the limited on-chip memory inside TMS320C6678 DSP, only
one CTU can be processed at a time. Assuming that there are
N CTUs in each frame, the four encoding steps are as follows.
First, due to the spatial-temporal referencing mechanism,
for the compress process of each CTU, the encoding frame,
reference frame, the restructured frame are required.
Consequently, 3N video data transmissions and 3N CTU info
transmissions are involved in the compress process of each
frame. 3N video data transmissions include: 1) 2N video data
transmissions of the encoding frame and the reference frame
from DDR3 to L2 memory. 2) N video data transmissions of
the restructured frame from L2 memory to DDR3. 3N CTU
info transmissions are: 1) 2N info transmissions for CTU info
of the current frame and the reference frame from DDR3 to L2
memory 2) N info transmissions for CTU info of the current
frame are from L2 memory to DDR3.
Second, deblocking filter is used to reduce visible artifacts
at block boundaries. All the vertical block boundaries are
filtered by horizontal filter, and then the remaining horizontal
block boundaries are filtered by vertical filter. 4N video data
transmissions and 2N CTU info transmissions are involved in
the deblocking filter of each frame. 2N video data
transmissions and N CTU info transmissions between DDR3
and L2 are necessary for horizontal filter, while the other 2N
video data transmissions and N CTU info transmissions are
needed by vertical filter.
Third, HEVC employs SAO to reduce the difference
between the encoding frame and the reconstructed frame. 3N
video data transmissions and N CTU info transmissions are
involved in the SAO process of each frame, where 3N video
data transmissions are: 1) 2N video data transmissions of the
encoding frame and the restructured frame from DDR3 to L2
are required, which are inputs to SAO; 2) N video data
transmissions of the restructured frame from L2 memory to
DDR3 are required, which are outputs of SAO. For the SAO
of encoding frame, N CTU info transmissions for the CTU
info from DDR3 to L2 memory are required.
Core0
(Master)
Video Data
Control Command
Encoder
Configuration
DDR3
Video Data
Core1
(Slaver)
Core2
(Slaver)
Core3
(Slaver)
Core4
(Slaver)
Core5
(Slaver)
Core6
(Slaver)
…
…
…
…
Bitstream
Bitstream
TI TMS320C6678
Feedback Info.
Bitstream
Video Data
Bitstream
Feedback
Info.
Shared
Memory
(4MB)
Core7
(Slaver)
Coding Info.
Shared
Data
Reconstructed Video
…
…
Fig. 2. Diagram of TI TMS320C6678-based HEVC encoder.
Lastly, the CTU info of the current frame is transferred into
TMS320C6678 DSP to generate the bitstream with entropy
coding, which contains N CTU info transmissions.
In each of the above four steps, there are a large amount of
data transmission between internal and external memory, and
these data transmissions have a tremendous impact on the
encoding speed on the DSP. In addition to the amount of data,
the following data transmission processes, such as the
frequency of data transmission, the mode of data transmission,
and the parallelism of data transmission, are also very
important to improve the encoding speed. These problems will
be carefully analyzed and addressed later.
In all, there are 10N video data transmissions and 7N CTU
info transmissions in the encoding process of each frame,
which is composed of N CTUs. It means that 17 data
transmissions should be employed in the encoding process of
each CTU. For the optimization of HEVC encoder on DSP,
Problem 1 will be summarized as follows: The frequency of
data transmission between the internal and external memory
is too high, since the internal memory of hardware is limited.
Furthermore, Table I gives the detailed time consumption of
reading/writing DDR3 memory on TMS320C6678. It can be
seen the difference of data access efficiency is dramatic with
different ways as direct access, memcpy, and EDMA. So it is
critical to select an available and reasonable data transmission
for the optimization of HEVC encoder on DSP. Therefore, the
second key problem for the optimization of HEVC encoder on
DSP can be summarized as Problem 2: The time consumption
of data transmission between the internal and external
memory is obvious, which cannot be negligible.
In order to identify the most time-consuming modules in the
DSP-based HEVC encoder, we analyze the execution time of
our HM10.0 [43]-based embedded HEVC encoder on TI
TMS320C6678 DSP. The first 100 frames of the video
sequences in Class B (1080p) are encoded with a QP of 32
under Low Delay P main configuration. The rest of the
configurations are kept unchanged [44]. Table II illustrates the
average time-consumption percentage of key HEVC encoding
modules on TMS320C6678. According to the result, encoding
modules such as interpolation, transform, Quantization and
SAO, occupy the vast majority of the encoding complexity.
Therefore, in order to speedup HEVC encoding and achieve
real-time coding, Problem 3 will be summarized as: The
calculation of these key modules in HEVC standard is
serialized, which is too time-consuming to apply in real-time
systems.
In summary, there are three major problems in developing
the TI Keystone multicore DSP-based highly-paralleled low-
cost fast HEVC encoding solution, each of which will be
carefully studied, well designed and implemented. More
specifically, to address Problem 1, an hardware-friendly
memory-efficient encoding structure will be developed in
TABLE I. DATA ACCESS EFFICIENCY (μs) ON TMS320C6678.