AUDIO COMPRESSION
EBU TECHNICAL REVIEW – January 2006 3 / 12
S. Meltzer and G. Moser
HE-AAC-encoded audio data can exist in a variety of file formats with different extensions,
depending on the implementation and the usage scenario. Most commonly used file formats are the
MPEG-4 file formats MP4 and M4A, carrying the respective extensions .mp4 and .m4a. The “.m4a”
extension is used to emphasize the fact that a file contains audio only. The 3GP file format supports
all HE-AAC features for mono and stereo files up to a 48 kHz sampling rate. Additional file formats,
such as MPEG-2 and MPEG-4 ADTS are also available, along with others.
MPEG AAC
Research on perceptual audio codecs started about twenty years ago. Earlier research on the
human auditory system had revealed that hearing is mainly based on a short-term spectral analysis
of the audio signal. The so-called masking effect was observed: the human auditory system is not
able to perceive distortions that are masked by a stronger signal in the spectral neighbourhood.
Thus, when looking at the short-term spectrum, a so-called masking threshold can be calculated
for this spectrum. Distortions below this threshold are inaudible in the ideal case.
The goal is to calculate the masking threshold based on a psychoacoustic model and to process the
audio signal in a way that only audible information resides in the signal. Ideally, the distortion intro-
duced is exactly below the masking threshold and thus remains inaudible. Fig. 2 illustrates the
quantization noise produced by an ideal perceptual coding process.
If the compression rate is further increased, the distortion introduced by the codec violates the
masking threshold and produces audible artefacts (Fig. 3).
The main method of overcoming this problem in traditional perceptual waveform codecs is to limit
the audio bandwidth. As a consequence, more information is available for the remainder of the
spectrum, resulting in a clean but dull-sounding signal. Another method, called intensity stereo, can
only be used for stereo signals. In intensity stereo, only one channel and some panning information
is transmitted, instead of a left and a right channel. However, this is only of limited use in increasing
the compression efficiency as, in many cases, the stereo image of the audio signal gets destroyed.
At this stage, research on classical perceptual audio coding had reached its limits, as the hitherto
known methods did not seem to provide more potential to further increase coding efficiency. Hence,
a shift in paradigm was needed, represented by the idea that different elements of an audio signal,
such as spectral components or the stereo image, deserve different tools if they are to be coded
more efficiently. This idea led to the development of the enhancement tools, Spectral Band Replica-
tion and Parametric Stereo.
Spectral Band Replication
In traditional audio coding, a significant amount of information is spent in coding the high frequen-
cies, although the psychoacoustic importance of the last one or two octaves is relatively low. This
Masking Threshold
Energy
Frequency
Signal Energy
Quantization Noise
Introduced by an ideal perceptual codec
Quantisation Noise
violating the
Masking Threshold
Energy
Frequency
Figure 2
Inaudible quantization noise produced by an
ideal perceptual coding process
Figure 3
Waveform coding going beyond its limits:
audible artefacts appear above the masking
threshold