i
1
. Alternatively, a peer q that was part of snapshot
(i − 1) but was not present in snapshot i must have left
during the interval 2∆ from the start of one crawl and
the end of the next. Therefore, we can measure with
a granularity of 2∆ the departure and arrival times of
every peer. We note that as the number of active peers
grows, the duration of each crawl (∆) increases, and thus
the granularity of our measurements becomes coarser,
i.e., there is a tradeoff between the size of a snapshot
and the accuracy of the measured arrival and departure
times. Therefore, it is essential to minimize the duration
of crawls (∆). Once we determine the departure and ar-
rival times of a peer p within a sequence of back-to-back
snapshots, we can easily determine the duration of one
appearance which is called its session time as follows:
SessionT ime = DepartureT ime − ArrivalT ime.
We also use the term uptime for active peer p to denote
the duration of time since its arrival.
To ensure that measured session times are not biased,
we use the “create-based method” employed by Saroiu
et al. [15]: Given a sequence of back-to-back snapshots
during a window of τ minutes, we split the measurement
window into two halves. Then, we only keep the ses-
sion time for those peers that (i) arrive during the first
half, (ii) leave during either the first or second half of
the measurement window, and (iii) their session time is
not longer than
τ
2
. This guarantees unbiased results for
sessions shorter than
τ
2
, but tells us nothing about the dis-
tribution of longer sessions. To avoid time-of-day bias in
our results we chose τ = 2 days. Our initial measure-
ments, as well as previous studies [3], show fluctuations
in network size correlated with the time of day.
In the following subsections, we present a brief
overview of our candidate applications, and discuss
application-specific issues in capturing accurate and rep-
resentative snapshots.
2.1 BitTorrent
BitTorrent is a popular P2P application that is often used
for the distribution of very large files from a source to
a large group of users (called a swarm). Peers form an
overlayand exchange different blocks of the content until
each peer has the entire file. Each swarm is coordinated
by a rendezvous point, called a tracker, whose address is
provided out of band. Each new peer contacts the tracker
to join the swarm, periodically sends an update of its
progress, and informs the tracker when it departs. Note
that each peer may receive the entire file across multi-
ple sessions, i.e., it may obtain only a subset of blocks
in one session and resume the download later. Since the
tracker logs all its interactions with group members, the
1
The interval is 2∆ rather than ∆ because there is a possibility
the peer arrives during crawl i − 1 after the crawler has passed its
neighborhood.
log provides detailed information about the arrival and
departure times of each peer.
We have obtained tracker logs from two long Bit-
Torrent swarms: distributions of Debian and Red Hat
2
.
Close examination of these tracker logs reveals that
roughly 50% of participating peers contact the tracker
within every 5 minutes, and 99% of them contact the
tracker within every31 minutes. However, peers may de-
part in an ungraceful fashion and abruptly stop contact-
ing the tracker. To identify these peers, we conservatively
assume any peer that has not contacted the tracker within
35 minutes has ungracefully departed. These make up
around one third of all sessions in our dataset and were
eliminated since we can not measure their session time.
We note that the session time for a BitTorrent client is
a combination of time spent downloading the file (the
download time) and additional time that the user leaves
the client running after the download is complete (the
lingering time). While the download time might be in-
fluenced by the size of the file or the number of other
peers, the lingering time is directly determined by user
behavior. Furthermore, the user can directly control the
duration of each session by stopping the application dur-
ing the download and returning at a later time to com-
plete the file download. Since the tracker log presents
the evolution of delivered content to each peer, it allows
us to separate download time from lingering time in our
analysis and examine them separately.
2.2 Gnutella
Gnutella is a popular P2P file-sharing applications with
more than 1.3 million concurrent peers [19]. Each peer
joins the network by connecting to a random group of
participating peers. Since Gnutella is not run as a dae-
mon, the arrival and departure times of each peer are trig-
gered by user behavior, i.e., session times are driven by
when the user opens and closes the application. There is
no central node in the Gnutella network that keeps track
of all participating peers, therefore the only way to dis-
cover all peers is to crawl the overlay. Given a few par-
ticipating peers in the session, a crawler progressively
contacts peers to learn about their neighbors, until it dis-
covers all the peers. The large size of the Gnutella net-
work makes it a challenge to capture a complete crawl
quickly. To address this, previous studies have selected a
random subset of peers discoveredby a partial crawl, and
periodically probe those peers to measure their session
time (e.g., [15, 3]). The key question is “Does the ses-
sion times of such a subset of peers represent the entire
population of sessions in the Gnutella network?”. With
a heavy-tailed distribution of session time [16, 5], peers
2
We would like to thank Ernst Biersack from the Institut Eurecom
who has kindly shared their Red Had tracker logs with us [7]. We
obtained the Debian tracker logs directly from the Debian organization.
3