ITU-T Rec. H.264 (03/2005) – Prepublished version 16
This Recommendation | International Standard specifies a syntax and decoding process for video that originated in either
progressive-scan or interlaced-scan form, which may be mixed together in the same sequence. The two fields of an
interlaced frame are separated in capture time while the two fields of a progressive frame share the same capture time.
Each field may be coded separately or the two fields may be coded together as a frame. Progressive frames are typically
coded as a frame. For interlaced video, the encoder can choose between frame coding and field coding. Frame coding or
field coding can be adaptively selected on a picture-by-picture basis and also on a more localized basis within a coded
frame. Frame coding is typically preferred when the video scene contains significant detail with limited motion. Field
coding typically works better when there is fast picture-to-picture motion.
0.6.3 Picture partitioning into macroblocks and smaller partitions
This subclause does not form an integral part of this Recommendation | International Standard.
As in previous video coding Recommendations and International Standards, a macroblock, consisting of a 16x16 block
of luma samples and two corresponding blocks of chroma samples, is used as the basic processing unit of the video
decoding process.
A macroblock can be further partitioned for inter prediction. The selection of the size of inter prediction partitions is a
result of a trade-off between the coding gain provided by using motion compensation with smaller blocks and the
quantity of data needed to represent the data for motion compensation. In this Recommendation | International Standard
the inter prediction process can form segmentations for motion representation as small as 4x4 luma samples in size, using
motion vector accuracy of one-quarter of the luma sample grid spacing displacement. The process for inter prediction of
a sample block can also involve the selection of the picture to be used as the reference picture from a number of stored
previously-decoded pictures. Motion vectors are encoded differentially with respect to predicted values formed from
nearby encoded motion vectors.
Typically, the encoder calculates appropriate motion vectors and other data elements represented in the video data
stream. This motion estimation process in the encoder and the selection of whether to use inter prediction for the
representation of each region of the video content is not specified in this Recommendation | International Standard.
0.6.4 Spatial redundancy reduction
This subclause does not form an integral part of this Recommendation | International Standard.
Both source pictures and prediction residuals have high spatial redundancy. This
Recommendation | International Standard is based on the use of a block-based transform method for spatial redundancy
removal. After inter prediction from previously-decoded samples in other pictures or spatial-based prediction from
previously-decoded samples within the current picture, the resulting prediction residual is split into 4x4 blocks. These
are converted into the transform domain where they are quantised. After quantisation many of the transform coefficients
are zero or have low amplitude and can thus be represented with a small amount of encoded data. The processes of
transformation and quantisation in the encoder are not specified in this Recommendation | International Standard.
0.7 How to read this specification
This subclause does not form an integral part of this Recommendation | International Standard.
It is suggested that the reader starts with clause
1 (Scope) and moves on to clause 3 (Definitions). Clause 6 should be
read for the geometrical relationship of the source, input, and output of the decoder. Clause
7 (Syntax and semantics)
specifies the order to parse syntax elements from the bitstream. See subclauses
7.1-7.3 for syntactical order and see
subclause
7.4 for semantics; i.e., the scope, restrictions, and conditions that are imposed on the syntax elements. The
actual parsing for most syntax elements is specified in clause
9 (Parsing process). Finally, clause 8 (Decoding process)
specifies how the syntax elements are mapped into decoded samples. Throughout reading this specification, the reader
should refer to clauses
2 (Normative references), 4 (Abbreviations), and 5 (Conventions) as needed. Annexes A through
E also form an integral part of this Recommendation | International Standard.
Annex A specifies seven profiles (Baseline, Main, Extended, High, High 10, High 4:2:2 and High 4:4:4), each being
tailored to certain application domains, and defines the so-called levels of the profiles. Annex B specifies syntax and
semantics of a byte stream format for delivery of coded video as an ordered stream of bytes. Annex C specifies the
hypothetical reference decoder and its use to check bitstream and decoder conformance. Annex D specifies syntax and
semantics for supplemental enhancement information message payloads. Finally, Annex E specifies syntax and
semantics of the video usability information parameters of the sequence parameter set.
Throughout this specification, statements appearing with the preamble "NOTE -" are informative and are not an integral
part of this Recommendation | International Standard.