A Brief History of Audio/Video Networking
The idea of using packet networks—such as the Internet—to transport voice and video is not new. Experiments with
voice over packet networks stretch back to the early 1970s. The first RFC on this subject—the Network Voice
Protocol (NVP)1—dates from 1977. Video came later, but still there is over ten years of experience with audio/video
conferencing and streaming on the Internet.
Early Packet Voice and Video Experiments
The initial developers of NVP were researchers transmitting packet voice over the ARPANET, the predecessor to
the Internet. The ARPANET provided a reliable-stream service (analogous to TCP/IP), but this introduced too much
delay, so an "uncontrolled packet" service was developed, akin to the modern UDP/IP datagrams used with RTP.
The NVP was layered directly over this uncontrolled packet service. Later the experiments were extended beyond
the ARPANET to interoperate with the Packet Radio Network and the Atlantic Satellite Network (SATNET),
running NVP over those networks.
All of these early experiments were limited to one or two voice channels at a time by the low bandwidth of the early
networks. In the 1980s, the creation of the 3-Mbps Wideband Satellite Network enabled not only a larger number of
voice channels but also the development of packet video. To access the one-hop, reserved-bandwidth, multicast
service of the satellite network, a connection-oriented inter-network protocol called the Stream Protocol (ST) was
developed. Both a second version of NVP, called NVP-II, and a companion Packet Video Protocol were
transported over ST to provide a prototype packet-switched video teleconferencing service.
In 1989–1990, the satellite network was replaced with the Terrestrial Wideband Network and a research network
called DARTnet while ST evolved into ST-II. The packet video conferencing system was put into scheduled
production to support geographically distributed meetings of network researchers and others at up to five sites
simultaneously.
ST and ST-II were operated in parallel with IP at the inter-network layer but achieved only limited deployment on
government and research networks. As an alternative, initial deployment of conferencing using IP began on DARTnet,
enabling multiparty conferences with NVP-II transported over multicast UDP/IP. At the March 1992 meeting of the
IETF, audio was transmitted across the Internet to 20 sites on three continents over multicast "tunnels"—the Mbone
(which stands for "multicast backbone")—extended from DARTnet. At that same meeting, development of RTP was
begun.
Audio and Video on the Internet
Following from these early experiments, interest in video conferencing within the Internet community took hold in the
early 1990s. At about this time, the processing power and multimedia capabilities of workstations and PCs became
sufficient to enable the simultaneous capture, compression, and playback of audio and video streams. In parallel,
development of IP multicast allowed the transmission of real-time data to any number of recipients connected to the
Internet.
Video conferencing and multimedia streaming were obvious and well-executed multicast applications. Research
groups took to developing tools such as vic and vat from the Lawrence Berkeley Laboratory,87 nevot from the
University of Massachusetts, the INRIA video conferencing system, nv from Xerox PARC, and rat from University
College London.77 These tools followed a new approach to conferencing, based on connectionless protocols, the
end-to-end argument, and application-level framing.65,70,76 Conferences were minimally managed, with no
admission or floor control, and the transport layer was thin and adaptive. Multicast was used both for wide-area data
transmission and as an interprocess communication mechanism between applications on the same machine (to
exchange synchronization information between audio and video tools). The resulting collaborative environment
consisted of lightly coupled applications and highly distributed participants.
The multicast conferencing (Mbone) tools had a significant impact: They led to widespread understanding of the
problems inherent in delivering real-time media over IP networks, the need for scalable solutions, and error and
congestion control. They also directly influenced the development of several key protocols and standards.
RTP was developed by the IETF in the period 1992–1996, building on NVP-II and the protocol used in the original
vat tool. The multicast conferencing tools used RTP as their sole data transfer and control protocol; accordingly, RTP
not only includes facilities for media delivery, but also supports membership management, lip synchronization, and
reception quality reporting.
In addition to RTP for transporting real-time media, other protocols had to be developed to coordinate and control
the media streams. The Session Announcement Protocol (SAP)35 was developed to advertise the existence of
multicast data streams. Announcements of sessions were themselves multicast, and any multicast-capable host could
receive SAP announcements and learn what meetings and transmissions were happening. Within announcements, the
Session Description Protocol (SDP)15 described the transport addresses, compression, and packetization schemes
to be used by senders and receivers in multicast sessions. Lack of multicast deployment, and the rise of the World
Wide Web, have largely superseded the concept of a distributed multicast directory, but SDP is still used widely
today.
Finally, the Mbone conferencing community led development of the Session Initiation Protocol (SIP).28 SIP was
intended as a lightweight means of finding participants and initiating a multicast session with a specific set of
participants. In its early incarnation, SIP included little in the way of call control and negotiation support because such
aspects were not used with the Mbone conferencing environment. It has since become a more comprehensive
protocol, including extensive negotiation and control features.
ITU Standards
In parallel with the early packet voice work was the development of the Integrated Services Digital Network
(ISDN)—the digital version of the plain old telephone system—and an associated set of video conferencing
standards. These standards, based around ITU recommendation H.320, used circuit-switched links and so are not
directly relevant to our discussion of packet audio and video. However, they did pioneer many of the compression
algorithms used today (for example, H.261 video).
The growth of the Internet and the widespread deployment of local area networking equipment in the commercial
world led the ITU to extend the H.320 series of protocols. Specifically, they sought to make the protocols suitable for
"local area networks which provide a non-guaranteed quality of service," IP being a classic protocol suite fitting the
description. The result was the H.323 series of recommendations.
H.323 was first published in 199762 and has undergone several revisions since. It provides a framework consisting
of media transport, call signaling, and conference control. The signaling and control functions are defined in ITU
recommendations H.225.0 and H.245. Initially the signaling protocols focused principally on interoperating with
ISDN conferencing using H.320, and as a result suffered from a cumbersome session setup process that was
simplified in later versions of the standard. For media transport, the ITU working group adopted RTP. However,
H.323 uses only the media transport functionality of RTP and makes little use of the control and reporting elements.
H.323 met with reasonable success in the marketplace, with several hardware and software products built to support
the suite of H.323 technologies. Development experience led to complaints about its complexity, in particular the
complex setup procedure of H.323 version 1 and the use of binary message formats for the signaling. Some of these
issues were addressed in later versions of H.323, but in the intervening period interest in alternatives grew.
One of those alternatives, which we have already touched on, was SIP. The initial SIP specification was published by
the IETF in 1999,28 as the outcome of an academic research project with virtually no commercial interest. It has
since come to be seen as a replacement for H.323 in many quarters, and it is being applied to more varied
applications, such as text messaging systems and voice-over-IP. In addition, it is under consideration for use in
third-generation cellular telephony systems,115 and it has gathered considerable industry backing.
The ITU has more recently produced recommendation H.332, which combines a tightly coupled H.323 conference
with a lightweight multicast conference. The result is useful for scenarios such as an online seminar, in which the H.323
part of the conference allows close interaction among a panel of speakers while a passive audience watches via
multicast.
Audio/Video Streaming
In parallel with the development of multicast conferencing and H.323, the World Wide Web revolution took place,
bringing glossy content and public acceptance to the Internet. Advances in network bandwidth and end-system
capacity made possible the inclusion of streaming audio and video along with Web pages, with systems such as
RealAudio and QuickTime leading the way. The growing market in such systems fostered a desire to devise a
standard control mechanism for streaming content. The result was the Real-Time Streaming Protocol (RTSP),14
providing initiation and VCR-like control of streaming presentations; RTSP was standardized in 1998. RTSP builds
on existing standards: It closely resembles HTTP in operation, and it can use SDP for session description and RTP for
media transport.
This document is created with the unregistered version of CHM2PDF Pilot