server content structure). The proposed policy called
Communities identification with Betweenness Cen-
trality (CiBC) identifies overlapped Web page com-
munities using the concept of Betweenness Centrality
(BC) [5]. Specifically, Newman and Girvan [22] have
used the concept of edge betweenness to select edges
to be removed from the graph so as to devise a
hierarchical agglomerative clustering procedure,
which though is not capable of providing the final
communities but requires intervening of adminis-
trators. Contrary to this work [22], the BC is used, in
this paper, to measure how central each node of the
Web site graph is within a community.
. Experimenting on a detailed simulation testbed, since the
experimentation carried out involves numerous
experiments to evaluate the proposed scheme under
regular traffic and under flash crowd events. Current
usage of Web technologies and Web server content
performance characteristics during a flash crowd
event are highlighted, and from our experimentation,
the proposed approach is shown to be robust and
effective in minimizing both the average response
time of users’ requests and the costs of CDNs’
providers.
1.2 Road Map
The rest of this paper is structured as follows: Section 2
discusses the related work. In Section 3, we formally define
the problem addressed in this paper. Section 4 presents the
proposed policy. Sections 5 and 6 present the simulation
testbed, examined policies, and performance measures.
Section 7 evaluates the proposed approach, and finally,
Section 8 concludes this paper.
2RELEVANT WORK
2.1 Content Outsourcing Policies
As identified by earlier research efforts [9], [15], the choice of
the outsourced content has a crucial impact in terms of
CDN’s pricing [15] and CDN’s performance [9], and it is
quite complex and challenging, if we consider the dynamic
nature of the Web. A naive solution to this problem is to
outsource all the objects of the Web server content (full
mirroring) to all the surrogate servers. The latter may seem
feasible, since the technological advances in storage media
and networking support have greatly improved. However,
the respective demand from the market greatly surpasses
these advantages. For instance, after the recent agreement
between Limelight Networks
4
and YouTube, under which
the first company is adopted as the content delivery platform
by YouTube, we can deduce, since this is proprietary
information, the huge storage requirements of the surrogate
servers. Moreover, the evolution toward completely perso-
nalized TV (e.g., the stage6)
5
reveals that the full content of
the origin servers cannot be completely outsourced as a
whole. Finally, the problem of updating such a huge
collection of Web objects is unmanageable. Thus, we have
to resort to a more “selective” outsourcing policy.
A few such content outsourcing policies have been
proposed in order to identify which objects to outsource for
replica ting to CDNs’ surrogate servers. These can be
categorized as follows:
. Empirical-based outsourcing. The Web server con-
tent administrators decide empirically about which
content will be outsourced [3].
. Popularity-based outsourcing. The most popular
objects are replicated to surrogate servers [37].
. Object-based outsourcing. The content is replicated
to surrogate servers in units of objects. Each object is
replicated to the surrogate server (under the storage
constraints) which gives the most performance gain
(greedy approach) [9], [37].
. Cluster-based outsourcing. The content is replicated
to surrogate servers in units of clusters [9], [14]. A
cluster is defined as a group of Web pages which
have some common characteristics with respect to
their content, the time of references, the number of
references, etc.
From the above content outsourcing policies, the object-
based one achieves high performance [9], [37]. However, as
pointed out by the authors of these policies, the huge
amount of objects results in not being implemented on a
real application. On the other hand, the popularity-based
outsourcing policies do not select the most suitable objects
for outsourcing, since the most popular objects remain
popular for a short time period [9]. Moreover, they require
quite a long time to collect reliable request statistics for each
object. Such a long interval though may not be available,
when a new Web server content is published to the Internet
and should be protected from flash crowd events.
Thus, we resort to exploit action of cluster-based out-
sourcing policies. The cluster-based one has also gained the
most attraction in the research community [9]. In such an
approach, the clusters may be identified by using conven-
tional data clustering algorithms. However, due to the lack
of a uniform schema for Web documents and dynamics of
Web data, the efficiency of these approaches is unsatisfac-
tory. Furthermore, most of them require administratively
tuned parameters (maximum cluster diameter, maximum
number of clusters) to decide the number of clusters, which
causes additional problems, since there is no a priori
knowledge about how many clusters of objects exist and of
what shape these clusters are.
In disaccordance with the above approaches, we exploit
the Web server content structure and consider each cluster
as a Web page community, where its characteristics are
that it reflects the dynamic and heterogeneity nature of the
Web. Specifically, it considers each page as a whole object,
rather than breaking down the Web page into information
pieces and reveals mutual relationships among the
concerned Web data.
2.2 Identifying Web Page Communities
In the literature there are several proposals for identifying
Web page communities [13], [16]. One of the key distin-
guishing properties of the algorithms that is usually
considered has to do with the degree of locality which is
used for assessing whether or not a page should be assigned
in a community. Regarding this feature, the methods for
identifying the communities can be summarized as follows:
KATSAROS ET AL.: CDNS CONTENT OUTSOURCING VIA GENERALIZED COMMUNITIES 3
4. http://www.limelightnetworks.com.
5. http://stage6.divx.com.