664 J. Comput. Sci. & Technol., July 2011, Vol.26, No.4
Technically speaking, the architecture of our ap-
proach is similar to [24]. The study in [24] dedicated
a large-scale clustering to ontology terms. It looked up
synonyms to establish a kernel, and extended the ker-
nel with identical terms in labels or identifiers. There
are two differences between our approach and [24]: 1)
we adopt OWL built-in vocabulary elements to build
the kernel, which have standard semantics and usually
are trustworthy, while [24] depended on thesauri, which
may be imprecise in some cases; and 2) we propose
ranking methods for the coreferent URIs, while [24]
used a uniform threshold to filter wrong URIs, which is
hard to decide across different domains.
Furthermore, our work is relevant to the problem
known as duplicate detection and coreference resolution
in the database and NLP fields, which have been exten-
sively studied in the past several decades
[3,13,15,25-26]
.
These methods are treated as similarity-based due to
lack of formal semantics to define equivalence. On the
Semantic Web, however, OWL provides well-defined se-
mantics for the equivalence relation, which must be
carefully considered. There also exist many works
that address instance matching in ontology (or schema)
matching
[27-28]
. But they have not been aware of the
characteristics of the Web, while our method uses the
dereferenceability of URIs to classify the coreference
confidence.
3 Finding Coreferent URIs
In the section, we will introduce our algorithm BOCr
for bootstrapping object coreferencing.
The overview of BOCr is illustrated in Fig.2, which
consists of two major iterations. Taking an object
URI u as input, the algorithm initializes an empty
queue Q and pushes u into Q. E is defined to
record the equivalence relations between URIs. For
building the kernel (Lines 4∼19), BOCr iteratively
picks up each unchecked URI v in Q, and starts
four parallel threads: CorefBySameAs(), CorefByIFP(),
CorefByFP() and CorefByCard(), in order to perform
coreferencing on v (see Lines 6∼13). Different equi-
valence relations with different marks (e.g., “same-as”)
are put into E through AddEquivRel() for further rank-
ing, while U
0
keeps the newly found URIs in each ite-
ration to avoid duplicate coreferencing in Q. The kernel
iteration converges when there is no URI in Q. Then,
the normalized textual descriptions of the URIs in the
kernel are extracted for extension. ExtendByDesc()
searches the objects with the same descriptions as the
ones in the kernel (Line 22). BOCr returns a set of
URIs U that denote the same object as u, and a set of
same-as, IFP, FP and cardinality relations E.
Input: A URI u that denotes an object.
Output: A coreferent URI set U , and a set of
same-as, IFP, FP and cardinality relations E.
1 U ← {u};
2 Q.Push(u); /* Q is a queue */
3 E ← ∅;
4 while Q 6= ∅ do /* Kernel */
5 v ← Q.Pop();
6 U
s
v
← CorefBySameAs(v);
7 AddEquivRel(E, v, U
s
v
, “same-as”);
8 U
i
v
← CorefByIFP(v);
9 AddEquivRel(E, v, U
i
v
, “IFP”);
10 U
f
v
← CorefByFP(v);
11 AddEquivRel(E, v, U
f
v
, “FP”);
12 U
c
v
← CorefByCard(v);
13 AddEquivRel(E, v, U
c
v
, “cardinality”);
14 U
0
← (U
s
v
∪ U
i
v
∪ U
f
v
∪ U
c
v
)\U;
15 for each v
0
∈ U
0
do
16 Q.Push(v
0
);
17 end
18 U ← U ∪ U
0
;
19 end
20 for each v ∈ U do /* Extension */
21 s ← Desc(v);
22 U
d
v
← ExtendByDesc(s);
23 U ← U ∪ U
d
v
;
24 end
25 return U, E;
Fig.2. Algorithm BOCr.
Example. Fig.3 illustrates a set of coreferent URIs
regarding Chris Bizer at the Free University Berlin,
which are represented with the solid pattern for the
kernel and the dotted one for the extension. Sup-
posing that sw:chris-bizer is the URI to start
from. By searching for the objects linking with
owl:sameAs, several coreferent URIs are discovered,
such as ontoworld:Chris Bizer. At the time, if we
have knowledge of some IFPs, which are shown with
asterisks (∗) in the figure, we find the objects hav-
ing the same values as well, e.g., bizer:chris. No-
tice that the construction of the kernel is an itera-
tive process, and the same-as or IFP relations (see the
solid lines in the figure) are recorded. For instance,
dblp:Christian Bizer is reached from bizer:chris,
and they have a same-as relation. Next, the kernel is
extended by using the descriptions of the URIs from
the kernel (e.g., “chris bizer”, “chris” and “christian
bizer”). This may cause errors, so ranking strategies to
reflect their confidences are required. We will introduce
our methods in Section 4.
3.1 Coreferencing by Same-As
Let U be a set of URI references, B be a set of blank