(iii) Erroneous tuple. For tuple t
3
, there is also no link from
Italy to Madrid in K (Fig. 2(d)). A negative answer
from the crowd to the question “Does Italy hasCapital
Madrid?” confirms that there is an error in t
3
, At this
point, however, we cannot decide which value in t
3
is wrong, Italy or Madrid. Katara will then extract
related evidences from K, such as Italy hasCapital Rome
and Spain hasCapital Madrid, and use these evidences
to generate a set of possible repairs for this tuple.
2
The pattern discovery module can be used to select the
more relevant kb for a given dataset. If the module cannot
find patterns for a table and a kb, Katara will terminate.
3. PRELIMINARIES
3.1 Knowledge Bases
We consider knowledge bases (kbs) as RDF-based data
consisting of resources, whose schema is defined using the
Resource Description Framework Schema (RDFS). A re-
source is a unique identifier for a real-word entity. For
instance, Rossi, the soccer player, and Rossi, the motorcy-
cle racer, are two different resources. Resources are rep-
resented using URIs (Uniform Resource Identifiers) in Yago
and DBPedia, and mids (machine-generated ids) in Freebase.
A literal is a string, date, or number, e.g., 1.78. A prop-
erty (a.k.a. relationship) is a binary predicate that repre-
sents a relationship between two resources or between a re-
source and a literal. We denote the property between re-
source x and resource (or literal) y by P (x, y). For instance,
locatedIn(Milan, Italy) indicates that Milan is in Italy.
An RDFS ontology distinguishes between classes and in-
stances. A class is a resource that represents a set of objects,
e.g., the class of countries. A resource that is a member of a
class is called an instance of that class. The type relationship
associates an instance to a class e.g., type(Italy) = country.
A more specific class c can be specified as a subclass of a
more general class d by using the statement subclassOf(c, d).
This means that all instances of c are also instances of d,
e.g., subclassOf(capital, location). Similarly, a property P
1
can be a sub-property of a property P
2
by the statement
subpropertyOf(P
1
, P
2
). Moreover, we assume that the prop-
erty between an entity and its readable name is labeled with
“label”, according to the RDFS schema.
Note that an RDF ontology naturally covers the case of a
kb without a class hierarchy such as IMDB. Also, a more ex-
pressive languages, such as OWL (Web Ontology Language),
can offer more reasoning opportunities at a higher computa-
tional cost. However, kbs in industry [14] as well as popular
ones, such as Yago, Freebase, and DBpedia, use RDFS.
3.2 Table Patterns
Consider a table T with attributes denoted by A
i
. There
are two basic semantic annotations on a relational table.
(1) Type of an attribute A
i
. The type of an attribute is an
annotation that represents the class of attribute values in A
i
.
For example, the type of attribute B in Fig. 1 is country.
(2) Relationship from attribute A
i
to attribute A
j
. The
relationship between two attributes is an annotation that rep-
resents how A
i
and A
j
are related through a directed binary
relationship. A
i
is called the subject of the relationship, and
A
j
is called the object of the relationship. For example, the
relationship from attribute B to C in Fig. 1 is hasCapital.
Table pattern. A table pattern (pattern for short) ϕ of a
table T is a labelled directed graph G(V, E) with nodes V
and edges E. Each node u ∈ V corresponds to an attribute
in T , possibly typed, and each edge (u, v) ∈ E from u to
v has a label P , denoting the relationship between two at-
tributes that u and v represent. For a pattern ϕ, we denote
by ϕ
u
a node u in ϕ, ϕ
(u,v)
an edge in ϕ, ϕ
V
all nodes in ϕ,
and ϕ
E
all edges in ϕ.
We assume that a table pattern is a connected graph.
When there exist multiple disconnected patterns, i.e., two
table patterns that do not share any common node, we treat
them independently. Hence, in the following, we focus on
discussing the case of a single table pattern.
Semantics. A tuple t of T matches a table pattern ϕ con-
taining m nodes {v
1
, . . . , v
m
} w.r.t. a kb K, denoted by
t |= ϕ, if there exist m distinct attributes {A
1
, . . . , A
m
} in
T and m resources {x
1
, . . . , x
m
} in K such that:
1. there is a one-to-one mapping from A
i
(and x
i
) to v
i
for i ∈ [1, m];
2. t[A
i
] ≈ x
i
and either type(x
i
) = type(v
i
) or
sub classOf(type(x
i
), type(v
i
));
3. for each edge (v
i
, v
j
) in ϕ
E
with property P , there
exists a property P
0
for the corresponding resources x
i
and x
j
in K such that P
0
= P or subpropertyOf(P
0
, P ).
Intuitively, if t matches ϕ, each corresponding attribute
value of t maps to a resource r in K under a domain-specific
similarity function (≈), and r is a (sub-)type of the type
given in ϕ (conditions 1 and 2). Moreover, for each property
P in a pattern, the property between the two corresponding
resources must be P or its sub-properties (condition 3).
Example 2: Consider tuple t
1
in Fig. 1 and pattern ϕ
s
in
Fig. 2(a). Tuple t
1
matches ϕ
s
, as in Fig. 2(b), since for each
attribute value (e.g., t
1
[A] = Rossi and t
1
[B] = Italy) there is
a resource in K that has a similar value with corresponding
type (person for Rossi and country for Italy) for conditions 1
and 2, and the property nationality holds from Rossi to Italy
in K (condition 3). Similarly, conditions 1–3 hold for other
attribute values in t
1
. Hence, t
1
|= ϕ
s
. 2
We say that a tuple t of T partially matches a table pattern
ϕ w.r.t. K, if at least one of condition 2 and condition 3
holds.
Example 3: Consider t
2
in Fig. 1 and ϕ
s
in Fig. 2(a).
We say that t
2
partially matches ϕ
s
, since the property
hasCapital from t
2
[B] = S. Africa to t
2
[C] = Pretoria does
not exist in K, i.e., condition 3 does not hold. 2
Given a table T , a kb K, and a pattern ϕ, Fig. 3 shows
how Katara works on T .
(1) Attributes covered by K. Attributes A–F in Fig. 1 are
covered by the pattern in Fig. 2(a). We consider two cases
for the tuples.
(a) Fully covered by K. We annotate such tuples as se-
mantically correct relative to ϕ and K (Fig. 2(b)).
(b) Partially covered by K. We use crowdsourcing to ver-
ify whether the non-covered data is caused by the
incompleteness of K (Fig. 2(c)) or by actual errors
(Fig. 2(d)).
(2) Attributes not covered by K. Attribute G in Fig. 1 is not