知识库与众包驱动的数据清洗系统KATARA

需积分: 9 201 浏览量更新于2024-07-16 收藏 1.09MB PDF 举报

"KATARA 是一篇关于数据清洗系统的学术论文，该系统利用知识库和众包技术提高错误检测的准确性。" 正文: KATARA 是一个创新的数据清洗系统，它的设计目标是解决传统数据清洗方法在准确性和效率上的局限性。传统的数据清洗方法主要依赖于完整性约束、统计分析或机器学习算法，但这些方法在处理数据错误时可能不够精确，尤其在解决模糊性和不确定性方面。随着知识库（包括通用和企业内部的）以及众包市场的兴起，KATARA 提供了新的机会，可以在更大规模上提升数据清洗的准确度。 KATARA 的核心理念是结合知识库和众包的力量，解析表格的语义，使其与知识库保持一致，从而识别出正确和不正确的数据。具体来说，当给定一个数据表、一个知识库和一个众包群体时，KATARA 会解析表中的信息，尝试理解其背后的含义，并将数据与知识库进行对齐。这个过程有助于识别潜在的错误和不一致性，然后系统能够生成针对错误数据的 top-k 可能修复方案，以供进一步验证和修正。实验结果表明，KATARA 可以广泛应用于各种数据集和知识库，并且能够高效地执行数据清洗任务。这表明，无论是对于结构化数据还是非结构化数据，KATARA 都具有良好的适应性。通过利用知识库的权威信息和众包的智慧，KATARA 能够处理更复杂的数据问题，提高数据质量，这对于数据分析、决策支持和业务流程优化等应用至关重要。此外，KATARA 的工作流程可能包括以下步骤： 1. **数据预处理**：初步检查数据，识别可能的异常值或格式问题。 2. **知识库匹配**：将数据表中的实体与知识库中的对应项进行匹配，以确定数据的一致性。 3. **语义解析**：分析数据的含义，理解其在特定上下文中的正确表示。 4. **错误检测**：基于知识库和语义理解，发现可能的错误或不一致性。 5. **众包参与**：利用众包平台，邀请用户对错误数据进行验证和修复提议。 6. **修复建议生成**：根据众包反馈，生成修复错误数据的多种可能性。 7. **评估与确认**：对修复建议进行评估，选择最佳修复方案并应用到原始数据。 KATARA 的出现为数据清洗领域带来了新的视角，它强调了在数据管理中利用外部知识源的重要性，并展示了如何将人工智能与人类智能相结合，以提升数据清洗的效率和质量。对于数据科学家、数据库管理员以及任何依赖高质量数据进行决策的人来说，理解和掌握 KATARA 的工作原理和技术将大有裨益。

(iii) Erroneous tuple. For tuple t

, there is also no link from

Italy to Madrid in K (Fig. 2(d)). A negative answer

from the crowd to the question “Does Italy hasCapital

Madrid?” conﬁrms that there is an error in t

, At this

point, however, we cannot decide which value in t

is wrong, Italy or Madrid. Katara will then extract

related evidences from K, such as Italy hasCapital Rome

and Spain hasCapital Madrid, and use these evidences

to generate a set of possible repairs for this tuple.

The pattern discovery module can be used to select the

more relevant kb for a given dataset. If the module cannot

ﬁnd patterns for a table and a kb, Katara will terminate.

3. PRELIMINARIES

3.1 Knowledge Bases

We consider knowledge bases (kbs) as RDF-based data

consisting of resources, whose schema is deﬁned using the

Resource Description Framework Schema (RDFS). A re-

source is a unique identiﬁer for a real-word entity. For

instance, Rossi, the soccer player, and Rossi, the motorcy-

cle racer, are two diﬀerent resources. Resources are rep-

resented using URIs (Uniform Resource Identiﬁers) in Yago

and DBPedia, and mids (machine-generated ids) in Freebase.

A literal is a string, date, or number, e.g., 1.78. A prop-

erty (a.k.a. relationship) is a binary predicate that repre-

sents a relationship between two resources or between a re-

source and a literal. We denote the property between re-

source x and resource (or literal) y by P (x, y). For instance,

locatedIn(Milan, Italy) indicates that Milan is in Italy.

An RDFS ontology distinguishes between classes and in-

stances. A class is a resource that represents a set of objects,

e.g., the class of countries. A resource that is a member of a

class is called an instance of that class. The type relationship

associates an instance to a class e.g., type(Italy) = country.

A more speciﬁc class c can be speciﬁed as a subclass of a

more general class d by using the statement subclassOf(c, d).

This means that all instances of c are also instances of d,

e.g., subclassOf(capital, location). Similarly, a property P

can be a sub-property of a property P

by the statement

subpropertyOf(P

, P

). Moreover, we assume that the prop-

erty between an entity and its readable name is labeled with

“label”, according to the RDFS schema.

Note that an RDF ontology naturally covers the case of a

kb without a class hierarchy such as IMDB. Also, a more ex-

pressive languages, such as OWL (Web Ontology Language),

can oﬀer more reasoning opportunities at a higher computa-

tional cost. However, kbs in industry [14] as well as popular

ones, such as Yago, Freebase, and DBpedia, use RDFS.

3.2 Table Patterns

Consider a table T with attributes denoted by A

. There

are two basic semantic annotations on a relational table.

(1) Type of an attribute A

. The type of an attribute is an

annotation that represents the class of attribute values in A

For example, the type of attribute B in Fig. 1 is country.

(2) Relationship from attribute A

to attribute A

. The

relationship between two attributes is an annotation that rep-

resents how A

and A

are related through a directed binary

relationship. A

is called the subject of the relationship, and

is called the object of the relationship. For example, the

relationship from attribute B to C in Fig. 1 is hasCapital.

Table pattern. A table pattern (pattern for short) ϕ of a

table T is a labelled directed graph G(V, E) with nodes V

and edges E. Each node u ∈ V corresponds to an attribute

in T , possibly typed, and each edge (u, v) ∈ E from u to

v has a label P , denoting the relationship between two at-

tributes that u and v represent. For a pattern ϕ, we denote

by ϕ

a node u in ϕ, ϕ

(u,v)

an edge in ϕ, ϕ

all nodes in ϕ,

and ϕ

all edges in ϕ.

We assume that a table pattern is a connected graph.

When there exist multiple disconnected patterns, i.e., two

table patterns that do not share any common node, we treat

them independently. Hence, in the following, we focus on

discussing the case of a single table pattern.

Semantics. A tuple t of T matches a table pattern ϕ con-

taining m nodes {v

, . . . , v

} w.r.t. a kb K, denoted by

t |= ϕ, if there exist m distinct attributes {A

, . . . , A

} in

T and m resources {x

, . . . , x

} in K such that:

1. there is a one-to-one mapping from A

(and x

) to v

for i ∈ [1, m];

2. t[A

] ≈ x

and either type(x

) = type(v

) or

sub classOf(type(x

), type(v

));

3. for each edge (v

, v

) in ϕ

with property P , there

exists a property P

for the corresponding resources x

and x

in K such that P

= P or subpropertyOf(P

, P ).

Intuitively, if t matches ϕ, each corresponding attribute

value of t maps to a resource r in K under a domain-speciﬁc

similarity function (≈), and r is a (sub-)type of the type

given in ϕ (conditions 1 and 2). Moreover, for each property

P in a pattern, the property between the two corresponding

resources must be P or its sub-properties (condition 3).

Example 2: Consider tuple t

in Fig. 1 and pattern ϕ

Fig. 2(a). Tuple t

matches ϕ

, as in Fig. 2(b), since for each

attribute value (e.g., t

[A] = Rossi and t

[B] = Italy) there is

a resource in K that has a similar value with corresponding

type (person for Rossi and country for Italy) for conditions 1

and 2, and the property nationality holds from Rossi to Italy

in K (condition 3). Similarly, conditions 1–3 hold for other

attribute values in t

. Hence, t

|= ϕ

. 2

We say that a tuple t of T partially matches a table pattern

ϕ w.r.t. K, if at least one of condition 2 and condition 3

holds.

Example 3: Consider t

in Fig. 1 and ϕ

in Fig. 2(a).

We say that t

partially matches ϕ

, since the property

hasCapital from t

[B] = S. Africa to t

[C] = Pretoria does

not exist in K, i.e., condition 3 does not hold. 2

Given a table T , a kb K, and a pattern ϕ, Fig. 3 shows

how Katara works on T .

(1) Attributes covered by K. Attributes A–F in Fig. 1 are

covered by the pattern in Fig. 2(a). We consider two cases

for the tuples.

(a) Fully covered by K. We annotate such tuples as se-

mantically correct relative to ϕ and K (Fig. 2(b)).

(b) Partially covered by K. We use crowdsourcing to ver-

ify whether the non-covered data is caused by the

incompleteness of K (Fig. 2(c)) or by actual errors

(Fig. 2(d)).

(2) Attributes not covered by K. Attribute G in Fig. 1 is not

剩余14页未读，继续阅读

Timothyxxx

粉丝: 78
资源: 1

知识库与众包驱动的数据清洗系统KATARA

KATARA-Microfluidics-Controller:一个用于大规模集成的开源控制器。 见我们的论文

计算机基础知识及应用技术总结

基于51单片机RFID智能门禁系统红外人流量计数统计.zip

时间序列-白银-周线数据

最新云码付多合一全自动码商 商户 代理 支付一体系统完整数据源码

Moonshot编程语言用户手册基础教程

开发API接口协议。非微信ipad协议、非mac协议非安卓协议，api可实现微信99功L.zip

matlab7-matlab教程.ppt

【语音去噪】基于matlab人声滤除滤波器【含Matlab源码 9172期】.mp4

(源码)基于JSP和Servlet的超市供应商订单管理系统.zip

最新资源

KATARA-Microfluidics-Controller:一个用于大规模集成的开源控制器。见我们的论文

最新云码付多合一全自动码商商户代理支付一体系统完整数据源码