PRE-PUBLICATION DRAFT, TO APPEAR IN IEEE TRANS. ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, DEC. 2012 3
This design strategy does not easily extend to larger transform
sizes, such as 16- and 32-point. HEVC thus takes a different
approach and simply defines transforms (of size 4×4, 8×8,
16×16, and 32×32) as straightforward fixed-point matrix
multiplications. The matrix multiplications for the vertical and
horizontal component of the inverse transform are shown in
(1) and (2), respectively.
Y = s
C
T
· T
(1)
R = Y
T
· T (2)
where s() is a scaling and saturating function that guarantees
that values of Y can be represented using 16 bits. Each factor
in the transform matrix T is represented using signed 8-
bit numbers. Operations are defined such that 16-bit signed
coefficients C are multiplied with the factors and hence greater
than 16-bit accumulation is required. As the transforms are
integer approximations of a discrete cosine transform (DCT),
they retain the symmetry properties thereof, thereby enabling
a “partial butterfly” implementation. For the 4-point transform,
an alternative transform approximating a discrete sine trans-
form (DST) is also defined.
Although there has been some concern about the imple-
mentation complexity of the 32-point transform, data given in
[7] indicates 158 cycles for an 8×8 inverse transform, 861
cycles for a 16×16 inverse transform, and 4696 cycles for a
32×32 inverse transform on an Intel processor. If normalizing
these values by the associated block sizes, respectively 2.47,
3.36, and 4.59 cycles are required per sample. The time cost
per sample of a 32×32 inverse transform is thus less than
twice that of an 8×8 inverse transform. Furthermore the cycle
count for larger transforms may often be reduced by taking
advantage of the fact that most high-frequency coefficients
are typically zero. Determining which bounding subblock of
coefficients is nonzero is facilitated by using a 4×4 coding
structure for the entropy coding of transform coefficients. The
bounding subblock may thus be determined at a reasonable
granularity (4×4) without having to consider the position of
each nonzero coefficient.
It should also be noted that the transform order is changed
with respect to H.264/AVC. HEVC defines a column-row order
for the inverse transform. Due to the regular uniform structure
of the matrix multiplication and partial butterfly designs, this
approach may be preferred in both hardware and software. In
software it is preferable to transform rows, as one entire row
of coefficients may easily be held in registers (a row of thirty-
two 32-bit accumulators requires eight 128-bit registers which
is implementable on several architectures without register
spilling). This property is not necessarily maintained with
more irregular but fully decomposed transform designs, which
look attractive in terms of primitive operation counts, but
require a greater number of registers and software operations
to implement. As can be seen from (1), applying the transpose
to the coefficients C allows implementations to transforms
rows only. Note that the transpose can be integrated in the
inverse scan without adding complexity.
E. Entropy coding
Unlike the H.264/AVC specification that features CAVLC
and CABAC [8] entropy coders, HEVC defines CABAC as
the single entropy coding method. CABAC incorporates three
stages: binarization of syntax elements, context modeling and
binary arithmetic coding. While the acronym and the core
arithmetic coding engine remain the same as in H.264/AVC,
there are a number of differences in context modeling and
binarization as described below.
In the development of HEVC, a substantial amount of effort
has been devoted to reduce the number of contexts. While
version 1.0 of the HM featured in excess of 700 contexts,
version 8.0 has only 172. This number compares favorably to
H.264/AVC, where 299 contexts are used, assuming support
for frame coding in the 4:2:0 color format (Progressive High
profile). 237 of these 299 contexts are involved in residual
signal coding whereas HEVC uses 127 of the 172 for this
purpose. When comparing the reduction of 46% in residual
coding with the reduction of 27% for the remaining syntax
elements, it becomes clear that most effort has been put into
reducing the number of contexts associated with the residual
syntax. This reduction in the number of contexts contributes to
lower the amount of memory required by the entropy decoder
and the cost of initializing the engine. Initialization values of
the states are defined with 8 bits per context, reduced from 16
in H.264/AVC, thereby further reducing memory requirements.
One widely used method for determining contexts in
H.264/AVC is to use spatial neighborhood relationships. For
example, using the value above and to the left to derive a con-
text for the current block. In HEVC such spatial dependencies
have been mostly avoided such as to reduce the number of
line buffers.
Substantial effort has also been devoted to enable par-
allel context processing, where a decoder has the ability
to derive multiple context indices in parallel. These tech-
niques apply mostly to transform coefficient coding, which
becomes the entropy decoding bottleneck at high bit rates.
One example is the modification of the significance map
coding. In H.264/AVC, two interleaved flags are used to signal
whether the current coefficient has a non-zero value (signifi-
cant coeff flag) and whether it is the last one in coding order
(last significant coeff flag). This makes it impossible to de-
rive the significant coeff flag and last significant coeff flag
contexts in parallel. HEVC breaks this dependency by ex-
plicitly signaling the horizontal and vertical offset of the last
significant coefficient in the current block before parsing the
significant coeff flags [9].
The burden of entropy decoding with context modeling
grows with bit rate as more bins need to be processed. There-
fore, the bin strings of large syntax elements are divided into
a prefix and a suffix. All prefix bins are coded in regular mode
(i.e., using context modeling), whereas all suffix bins are coded
in a bypass mode. The cost of decoding a bin in bypass mode
is lower than in regular mode. Furthermore, the ratio of bins
to bits is fixed at 1:1 for bypass mode, whereas it is generally
higher for the regular mode. In H.264/AVC, motion vector
differences and transform coefficient levels are binarized using