696 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 6, JUNE 2001
The MPEG-7 Visual Standard for Content
Description—An Overview
Thomas Sikora, Senior Member, IEEE
Abstract—The MPEG-7 Visual Standard under development
specifies content-based descriptors that allow users or agents (or
search engines) to measure similarity in images or video based
on visual criteria, and can be used to efficiently identify, filter, or
browse images or video based on visual content. More specifically,
MPEG-7 specifies color, texture, object shape, global motion, or
object motion features for this purpose. This paper outlines the
aim, methodologies, and broad details of the MPEG-7 Standard
development for visual content description.
Index Terms—Coding, descriptors, MPEG-7, similarity-based
retrieval, standardization, visual information.
I. INTRODUCTION
R
ECENT years have seen a rapid increase in volume of
image and video collections. A huge amount of informa-
tion is available, and every day, gigabytes of new visual infor-
mation is being generated, stored, and transmitted. However, it
is difficult to access this visual information unless it is orga-
nized in a suitable way—to allow efficient browsing, searching,
and retrieval. Image retrieval has been a very active research
and developmentdomain since the early 1970s. During the early
1990s—with the advent of digital video—research on video re-
trieval became of equal importance. A very popular means for
image or video retrieval is to annotate images or video with text,
and to use text-based database management systems to perform
image retrieval. However, text-based annotation has significant
drawbacks when confronted with large volumes of images. An-
notation can in these circumstance become significantly labor
intensive. Furthermore, since images are rich in content, text
may in many applications not be rich enough to describe images.
To overcome these difficulties in the early 1990s, content-based
image retrieval emerged as a promising means for describing
and retrieving images. Content-based image retrieval systems
describe images by their own visual content rather than text,
such as color, texture, and objects’ shape information.
In the late 1990s—with the large scale introduction of
digital images and video to the market [1]—the necessity
for interworking between image/video retrieval systems of
different vendors arose. For this purpose, in 1997 the ISO
MPEG Group initiated the “MPEG-7 Multimedia Description
Language” work item. The target of this activity was to issue
an international MPEG-7 Standard, defining standardized
descriptions and description systems that allow users or agents
Manuscript received January 2, 2001; revised March 12, 2001.
The author is with the Heinrich-Hertz-Institute for Communication Tech-
nology (HHI), Interactive Media—Human Factors, D-10587 Berlin, Germany
(e-mail: sikora@hhi.de).
Publisher Item Identifier S 1051-8215(01)04986-2.
to search, identify, filter, and browse audiovisual content
[2], [3]. MPEG-7 is currently still under definition and will
become international standard in July 2001. Besides support
for meta-data and text descriptions of the audiovisual content,
much focus in the development on MPEG-7 has been in the
definition of efficient content-based description and retrieval
specifications.
The purpose of this paper is to provide a broad overview of
the MPEG-7 content-based visual descriptors. For an overall
overview of the MPEG-7 Standard, more detailed descriptions
of MPEG-7 content-based visual, audio and speech descriptors,
the reader is referred to the literature in [2], [3] and to the re-
maining papers of this Special Issue on MPEG-7 in [5]–[12].
II. S
COPE OF MPEG-7 VISUAL STANDARD
The ultimate goal and objective of MPEG-7 Visual Standard
is to provide standardized descriptions of streamed or stored im-
ages or video—standardized header bits (visual low-level De-
scriptors) that help users or applications to identify, categorize
or filter images or video. These low-level descriptors can be
used to compare, filter, or browse image or video purely based
on nontext visual descriptions of the content, or if required, in
combination with common text-based queries. Because of their
descriptive features, the challenge for developing such MPEG-7
Visual nontext descriptors is that they must be meaningful in the
context of various applications. They will be used differently for
different user domains and different application environments.
Selected application examples include digital libraries (image
and video catalogue), broadcast media selection (TV channels),
and multimedia editing (personalised electronic news service,
media authoring). Among this diversityof possible applications,
the MPEG-7 Visual feature descriptors allow users or agents to
perform the following tasks taken as examples.
1) Graphics: Draw a few lines on a screen and get, in return,
a set of images containing similar graphics or logos.
2) Images: Define objects, including color patches or tex-
tures, and get, in return, examples among which you se-
lect the ones of interest.
3) Video: On a given set of video objects, describe object
movements, camera motion, or relations between objects
and get, in return, a list of videos with similar or dissimilar
temporal and spatial relations.
4) Video Activity: On a given video content, describe actions
and get a list of videos where similar actions happen.
The MPEG-7 Visual Descriptors describe basic audiovisual
content of media based on visual information. For images and
video, the content may be described, for example by the shape
1051–8215/01$10.00 ©2001 IEEE