1672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012
differences from older standards is its increased flexibility for
inter coding. For the purpose of motion-compensated predic-
tion, an MB can be partitioned into square and rectangular
block shapes with sizes ranging from 4 × 4to16× 16
luma samples. H.264/MPEG-4 AVC also supports multiple
reference pictures. Similarly to annex U of H.263, motion
vectors are associated with a reference picture index for
specifying the employed reference picture. The motion vectors
are transmitted using quarter-sample precision relative to the
luma sampling grid. Luma prediction values at half-sample
locations are generated using a 6-tap interpolation filter and
prediction values at quarter-sample locations are obtained by
averaging two values at integer- and half-sample positions.
Weighted prediction can be applied using a scaling and offset
for the prediction signal. For the chroma components, a
bilinear interpolation is applied. In general, motion vectors
are predicted by the component-wise median of the motion
vectors of three neighboring previously decoded blocks. For
16 ×8 and 8 ×16 blocks, the predictor is given by the motion
vector of a single already decoded neighboring block, where
the chosen neighboring block depends on the location of the
block inside an MB. In contrast to prior coding standards, the
concept of B pictures is generalized and the picture coding
type is decoupled from the coding order and the usage as a
reference picture. Instead of I, P, and B pictures, the standard
actually specifies I, P, and B slices. A picture can contain slices
of different types and a picture can be used as a reference
for inter prediction of subsequent pictures independently of
its slice coding types. This generalization allowed the usage
of prediction structures such as hierarchical B pictures [17]
that show improved coding efficiency compared to the IBBP
coding typically used for H.262/MPEG-2 Video.
H.264/MPEG-4 AVC also includes a modified design for
intra coding. While in previous standards some of the DCT
coefficients can be predicted from neighboring intra blocks, the
intra prediction in H.264/MPEG-4 AVC is done in the spatial
domain by referring to neighboring samples of previously
decoded blocks. The luma signal of an MB can be either
predicted as a single 16 × 16 block or it can be partitioned
into 4 × 4or8× 8 blocks with each block being predicted
separately. For 4 ×4 and 8 ×8 blocks, nine prediction modes
specifying different prediction directions are supported. In the
intra 16 ×16 mode and for the chroma components, four intra
prediction modes are specified.
For transform coding, H.264/MPEG-4 AVC specifies a 4×4
and an 8×8 transform. While chroma blocks are always coded
using the 4 × 4 transform, the transform size for the luma
component can be selected on an MB basis. For intra MBs,
the transform size is coupled to the employed intra prediction
block size. An additional 2×2 Hadamard transform is applied
to the four DC coefficients of each chroma component. For the
intra 16×16 mode, a similar second-level Hadamard transform
is also applied to the 4 × 4 DC coefficients of the luma
signal. In contrast to previous standards, the inverse transforms
are specified by exact integer operations, so that, in error-
free environments, the reconstructed pictures in the encoder
and decoder are always exactly the same. The transform
coefficients are represented using a uniform reconstruction
quantizer, that is, without the extra-wide dead-zone that is
found in older standards. Similar to H.262/MPEG-2 Video and
MPEG-4 Visual, H.264/MPEG-4 AVC also supports the usage
of quantization weighting matrices. The transform coefficient
levels of a block are generally scanned in a zig–zag fashion.
For entropy coding of all MB syntax elements, H.264/
MPEG-4 AVC specifies two methods. The first entropy coding
method, which is known as context-adaptive variable-length
coding (CAVLC), uses a single codeword set for all syntax
elements except the transform coefficient levels. The approach
for coding the transform coefficients basically uses the concept
of run-level coding as in prior standards. However, the effi-
ciency is improved by switching between VLC tables depend-
ing on the values of previously transmitted syntax elements.
The second entropy coding method specifies context-adaptive
binary arithmetic coding (CABAC) by which the coding
efficiency is improved relative to CAVLC. The statistics of
previously coded symbols are used for estimating conditional
probabilities for binary symbols, which are transmitted using
arithmetic coding. Inter-symbol dependencies are exploited
by switching between several estimated probability models
based on previously decoded symbols in neighboring blocks.
Similar to annex J of H.263, H.264/MPEG-4 AVC includes
a deblocking filter inside the motion compensation loop. The
strength of the filtering is adaptively controlled by the values
of several syntax elements.
The High profile (HP) of H.264/MPEG-4 AVC includes
all tools that contribute to the coding efficiency for 8-bit-per-
sample video in 4:2:0 format, and is used for the comparison
in this paper. Because of its limited benefit for typical video
test sequences and the difficulty of optimizing its parameters,
the weighted prediction feature is not applied in the testing.
E. HEVC (Draft 9 of October 2012)
High Efficiency Video Coding (HEVC) [4] is the name of
the current joint standardization project of ITU-T VCEG and
ISO/IEC MPEG, currently under development in a collabora-
tion known as the Joint Collaborative Team on Video Coding
(JCT-VC). It is planned to finalize the standard in early 2013.
In the following, a brief overview of the main changes relative
to H.264/MPEG-4 AVC is provided. For a more detailed
description, the reader is referred to the overview in [2].
In HEVC, a picture is partitioned into coding tree blocks
(CTBs). The size of the CTBs can be chosen by the encoder
according to its architectural characteristics and the needs of
its application environment, which may impose limitations
such as encoder/decoder delay constraints and memory re-
quirements. A luma CTB covers a rectangular picture area of
N ×N samples of the luma component and the corresponding
chroma CTBs cover each (N/2) × (N/2) samples of each of
the two chroma components. The value of N is signaled inside
the bitstream, and can be 16, 32, or 64. The luma CTB and the
two chroma CTBs, together with the associated syntax, form a
coding tree unit (CTU). The CTU is the basic processing unit
of the standard to specify the decoding process (conceptually
corresponding to an MB in prior standards).
The blocks specified as luma and chroma CTBs can be
further partitioned into multiple coding blocks (CBs). The