To appear in IEEE Transactions on Circuits and Systems for Video Technology, September 2007.
4
B. Video Coding Layer (VCL)
The VCL of H.264/AVC follows the so-called block-based
hybrid video coding approach. Although its basic design is
very similar to that of prior video coding standards such as
H.261, MPEG-1 Video, H.262 | MPEG-2 Video, H.263, or
MPEG-4 Visual, H.264/AVC includes new features that en-
able it to achieve a significant improvement in compression
efficiency relative to any prior video coding standard [14]. The
main difference to previous standards is the largely increased
flexibility and adaptability of H.264/AVC.
The way pictures are partitioned into smaller coding units in
H.264/AVC, however, follows the rather traditional concept of
subdivision into macroblocks and slices. Each picture is parti-
tioned into macroblocks that each covers a rectangular picture
area of 16×16 luma samples and, in the case of video in 4:2:0
chroma sampling format, 8×8 samples of each of the two
chroma components. The samples of a macroblock are either
spatially or temporally predicted, and the resulting prediction
residual signal is represented using transform coding. The
macroblocks of a picture are organized in slices, each of which
can be parsed independently of other slices in a picture. De-
pending on the degree of freedom for generating the prediction
signal, H.264/AVC supports three basic slice coding types:
– I slice: intra-picture predictive coding using spatial
prediction from neighboring regions,
– P slice: intra-picture predictive coding and inter-picture
predictive coding with one prediction signal for each
predicted region,
– B slice: intra-picture predictive coding, inter-picture
predictive coding, and inter-picture bi-predictive cod-
ing with two prediction signals that are combined with a
weighted average to form the region prediction.
For I slices, H.264/AVC provides several directional spatial
intra prediction modes, in which the prediction signal is gener-
ated by using neighboring samples of blocks that precede the
block to be predicted in coding order. For the luma compo-
nent, the intra prediction is either applied to 4×4, 8×8, or
16×16 blocks, whereas for the chroma components, it is al-
ways applied on a macroblock basis
1
.
For P and B slices, H.264/AVC additionally permits vari-
able block size motion-compensated prediction with multiple
reference pictures [27]. The macroblock type signals the parti-
tioning of a macroblock into blocks of 16×16, 16×8, 8×16, or
8×8 luma samples. When a macroblock type specifies parti-
tioning into four 8×8 blocks, each of these so-called sub-
macroblocks can be further split into 8×4, 4×8, or 4×4 blocks,
which is indicated through the sub-macroblock type. For P
slices, one motion vector is transmitted for each block. In addi-
tion, the used reference picture can be independently chosen
for each 16×16, 16×8, or 8×16 macroblock partition or 8×8
sub-macroblock. It is signaled via a reference index parameter,
1
Some details of the profiles of H.264/AVC that were designed primarily
to serve the needs of professional application environments are neglected in
this description, particularly in relation to chroma processing and range of
step sizes.
which is an index into a list of reference pictures that is repli-
cated at the decoder.
In B slices, two distinct reference picture lists are utilized,
and for each 16×16, 16×8, or 8×16 macroblock partition or
8×8 sub-macroblock, the prediction method can be selected
between list 0, list 1, or bi-prediction. While list 0 and list 1
prediction refer to unidirectional prediction using a reference
picture of reference picture list 0 or 1, respectively, in the bi-
predictive mode, the prediction signal is formed by a weighted
sum of a list 0 and list 1 prediction signal. In addition, special
modes as so-called direct modes in B slices and skip modes in
P and B slices are provided, in which such data as motion vec-
tors and reference indices are derived from previously trans-
mitted information.
For transform coding, H.264/AVC specifies a set of integer
transforms of different block sizes. While for intra macro-
blocks the transform size is directly coupled to the intra pre-
diction block size, the luma signal of motion-compensated
macroblocks that do not contain blocks smaller than 8×8 can
be coded by using either a 4×4 or 8×8 transform. For the
chroma components a two-stage transform, consisting of 4×4
transforms and a Hadamard transform of the resulting DC co-
efficients is employed
1
. A similar hierarchical transform is also
used for the luma component of macroblocks coded in intra
16×16 mode. All inverse transforms are specified by exact
integer operations, so that inverse-transform mismatches are
avoided. H.264/AVC uses uniform reconstruction quantizers.
One of 52 quantization step sizes
1
can be selected for each
macroblock by the quantization parameter QP. The scaling
operations for the quantization step sizes are arranged with
logarithmic step size increments, such that an increment of the
QP by 6 corresponds to a doubling of quantization step size.
For reducing blocking artifacts, which are typically the most
disturbing artifacts in block-based coding, H.264/AVC speci-
fies an adaptive deblocking filter, which operates within the
motion-compensated prediction loop.
H.264/AVC supports two methods of entropy coding, which
both use context-based adaptivity to improve performance
relative to prior standards. While CAVLC (context-based adap-
tive variable-length coding) uses variable-length codes and its
adaptivity is restricted to the coding of transform coefficient
levels, CABAC (context-based adaptive binary arithmetic cod-
ing) utilizes arithmetic coding and a more sophisticated
mechanism for employing statistical dependencies, which
leads to typical bit rate savings of 10-15% relative to CAVLC.
In addition to the increased flexibility on the macroblock
level, H.264/AVC also allows much more flexibility on a pic-
ture and sequence level compared to prior video coding stan-
dards. Here we mainly refer to reference picture memory con-
trol. In H.264/AVC, the coding and display order of pictures is
completely decoupled. Furthermore, any picture can be
marked as reference picture for use in motion-compensated
prediction of following pictures, independent of the slice cod-
ing types. The behavior of the decoded picture buffer (DPB),
which can hold up to 16 frames (depending on the used con-