没有合适的资源?快使用搜索试试~ 我知道了~
首页TRIP:交互检索-推断数据补全法:提升效率与召回率
TRIP:交互检索-推断数据补全法:提升效率与召回率
0 下载量 66 浏览量
更新于2024-08-26
收藏 2.25MB PDF 举报
TRIP: 交互式检索-推断数据插补方法是一项针对数据插补问题的创新解决方案,它旨在解决数据库中字符串属性值缺失值填充的问题。当前,大多数数据插补技术依赖于推断方法,这类方法往往由于仅限于数据集内部的信息,导致在处理缺失值时召回率不高。为了提升精度,研究人员提出了检索方法,通过从外部资源如万维网获取信息,这种方法虽然能提高召回率,但代价是大量的搜索查询,带来了高昂的开销。 本文的核心贡献在于研究了推断方法和检索方法之间的互补关系。作者发现,尽管检索大量缺失值能显著提升推断方法的性能,但并非所有值都需检索。因此,TRIP方法的目标是通过交互式的方式,即在检索和推断之间进行选择,找到最少的缺失值进行检索,以最大化利用推断能力。这种方法的关键在于设计一个优化策略,能够在保证召回率的同时,最大限度地减少外部资源的使用。 TRIP算法的设计着重于确定性数据插补中的检索推理调度,理论上证明了其优化方案的最优性。然而,当面临τ约束的随机数据插补(τ-SDI)这样的特定场景时,最优方案可能无法实现,但TRIP仍能找到接近最优的解决方案,确保了期望的性能。 实验证据来自对四个数据集的大量实验,结果显示,TRIP平均只需检索大约20%的缺失值,却能达到与基于检索方法相当的高召回率。这表明,TRIP不仅有效地减少了外部资源的消耗,还能保持良好的插补效果,对于实际应用中的数据完整性维护具有重要意义。TRIP为数据插补领域提供了一个有效且资源节约的解决方案,对于提升数据质量、降低维护成本具有积极的影响。
资源详情
资源推荐
IEEE TRANSACTION ON KNOWLEDGE AND DATA ENGINEERING 3
missing values. Nonetheless, our proposed approach can
achieve an “expected-optimal” scheme.
In summary, our main contributions are as follows:
1) We propose TRIP, a hybrid retrieving-inferring impu-
tation approach for missing non-quantitive string data.
It fills an incomplete table by alternately retrieving and
inferring, which reaches as high imputation recall as the
web-based retrieving approach with much less cost.
2) We formulate the problem of identifying the optimal
scheduling scheme with TRIP in DDI and SDI, and then
define the optimal schemes in both contexts.
3) We devise a heuristic algorithm to identify an optimal
scheme in DDI. We also propose a greedy algorithm to
identify an expected-optimal scheme in SDI.
Roadmap ∶ We give preliminaries on previous approaches
and then define the problem in TRIP in Sec. 2. We discuss
on finding optimal scheduling schemes for TRIP in DDI in
Sec. 3, and then generalize the situation to SDI in Sec. 4.
The experiments are reported in Sec. 5, and related work
is covered in Sec. 6. We conclude in Sec. 7.
2 PRELIMINARIES AND PROBLEMS
In this section, we briefly review inferring-based approach-
es and retrieving-based approaches for missing string data
imputation, and then define the problem in TRIP.
2.1 Inferring-based Approaches
Many inferring-based approaches are for missing quanti-
tive data, which will not be discussed in this paper. For
non-quantitive data, i.e., missing string data, the existing
Inferring-based approaches [35], [31] rely on a variety of
constraints such as Functional Dependencies (FDs) [1],
Conditional FDs [6], [11], Approximate FDs [28], [19]
and Inclusion Dependencies (INCs) [5] to infer missing
values from the complete part of the data set. For example,
given a table with four attributes [A, B, C, D], assume an
approximate CFD (A, B, C =“c
0
”→ D) hold with 0.85
confidence, which means that when the value of attribute
C equals to “c
0
”, 85% of the time that the values of A
and B will together decide the value of D in the same
tuple. Hence, if there is one tuple T
1
= [a
3
, b
4
, c
0
, d
1
, ⋯],
and another tuple T
2
= [a
3
, b
4
, c
0
, ◻], where ◻ denotes a
missing value under D, we can infer that ◻ = d
1
with a
0.85 confidence.
However, pure inferring-based imputation method often
fails to fill in some missing values as not all missing values
are inferable based on the inference rules corresponding to
those constraints. We say a missing value is inferable if
there is at least one way (i.e., one inference rule) to infer
its value from the other existing or inferable values.
2.2 Retrieving-based Approaches
Many retrieving-based approaches [23], [21], [10], [16],
[36], [17], [29] focus on retrieving missing values from
web lists or web tables only, but ignore all the plain-text
web documents. Recently, a general web-based retrieving
approach [23], [24] was proposed to retrieve missing values
from all kinds of web pages, which was proved to reach
a higher imputation recall than the other approaches. For
each blank in a data set, this general approach formulates
proper imputation search queries to retrieve relevant web
documents, from which the missing value will then be
extracted [25] to fill in the blank in the data set.
Similar to inference rules, a qualified imputation query
for a missing value can be formulated according to the
attribute dependencies holding on the data set. For example,
given Name → Email, we can formulate an imputation
query for a missing Email with the Name value in the
same instance. For instance, assume there is one tuple T
2
=
[a
1
, ◻, ⋯], where a
1
is a value under Name, and ◻ is a
blank under Email, the formulated imputation queries can
be (“a
1
’ email is”) or (“a
1
+ email + mailed to”), where
the used auxiliary information like the pattern (“xxx’s email
is”) [2], [7] or the context terms “email”, “mailed to” [22]
can be learned from complete instances, and then used for
better describing the required context.
Retrieving-based approaches can reach a much higher
imputation recall than inferring-based approaches on a wide
range of databases [23], [24], but they also cause a large
overhead for issuing a large number of queries to the Web.
2.3 TRIP: Interactive Retrieving & Inferring
As observed from our experiments, the cost of an inferring
operation (<0.001s) is negligible compared to that of a
web-based retrieving operation (nearly 1s). Based on this
observation, we propose an interactive retrieving and in-
ferring data imputation approach, TRIP, which can benefit
from the high recall of retrieving-based approach and the
efficiency of inferring-based approach. While an inferring
step in TRIP fills in all currently inferable missing values to
the greatest extent, the succeeding retrieving step retrieves
a set of selected missing values that make some unfilled
missing values become inferable for the next inferring step.
Inferring and retrieving are alternately performed until no
more missing values can be imputed.
The interaction between retrieving and inferring can be
simply represented by a sequence of missing value sets,
each of which contains values either for retrieving at a
retrieving step or for inferring at an inferring step. We call
this sequence of missing value sets as scheduling scheme.
Given an incomplete table with a set of missing values (or
blanks) O, an imputation scheduling scheme w.r.t. O can be
denoted as S = ⟨I
0
, R
1
, I
1
, R
2
, I
2
, ⋯, R
n
, I
n
⟩, where R
i
is a set of missing values for retrieving in the i-th retrieving
step and I
i
is a set of missing values for inferring in the
i-th inferring step, ∀i ≠ j, R
i
∩ I
i
= R
i
∩ I
j
= R
i
∩ R
j
=
I
i
∩ I
j
= ∅, and ∀i, R
i
⊆ O, I
i
⊆ O.
Let O
′
(S) = (
⋃
0≤i≤n
I
i
)
⋃
(
⋃
1≤i≤n
R
i
) denote the set of
filled values by TRIP following scheme S. The imputation
recall of TRIP by using S is: recall(S) =
∣O
′
(S)∣
∣O∣
, while its
cost can be roughly represented by the number of retrieved
values in S, i.e., cost(S) =
∑
1≤i≤n
∣R
i
∣, where ∣ ⋅ ∣ is the
size of a set. Our task is to identify an optimal scheduling
scheme that results in the highest imputation recall with the
minimum cost. More formally,
剩余13页未读,继续阅读
weixin_38632763
- 粉丝: 7
- 资源: 944
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- OptiX传输试题与SDH基础知识
- C++Builder函数详解与应用
- Linux shell (bash) 文件与字符串比较运算符详解
- Adam Gawne-Cain解读英文版WKT格式与常见投影标准
- dos命令详解:基础操作与网络测试必备
- Windows 蓝屏代码解析与处理指南
- PSoC CY8C24533在电动自行车控制器设计中的应用
- PHP整合FCKeditor网页编辑器教程
- Java Swing计算器源码示例:初学者入门教程
- Eclipse平台上的可视化开发:使用VEP与SWT
- 软件工程CASE工具实践指南
- AIX LVM详解:网络存储架构与管理
- 递归算法解析:文件系统、XML与树图
- 使用Struts2与MySQL构建Web登录验证教程
- PHP5 CLI模式:用PHP编写Shell脚本教程
- MyBatis与Spring完美整合:1.0.0-RC3详解
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功