TRIP：交互检索-推断数据补全法：提升效率与召回率

161 浏览量更新于2024-08-26 收藏 2.25MB PDF 举报

TRIP: 交互式检索-推断数据插补方法是一项针对数据插补问题的创新解决方案，它旨在解决数据库中字符串属性值缺失值填充的问题。当前，大多数数据插补技术依赖于推断方法，这类方法往往由于仅限于数据集内部的信息，导致在处理缺失值时召回率不高。为了提升精度，研究人员提出了检索方法，通过从外部资源如万维网获取信息，这种方法虽然能提高召回率，但代价是大量的搜索查询，带来了高昂的开销。本文的核心贡献在于研究了推断方法和检索方法之间的互补关系。作者发现，尽管检索大量缺失值能显著提升推断方法的性能，但并非所有值都需检索。因此，TRIP方法的目标是通过交互式的方式，即在检索和推断之间进行选择，找到最少的缺失值进行检索，以最大化利用推断能力。这种方法的关键在于设计一个优化策略，能够在保证召回率的同时，最大限度地减少外部资源的使用。 TRIP算法的设计着重于确定性数据插补中的检索推理调度，理论上证明了其优化方案的最优性。然而，当面临τ约束的随机数据插补（τ-SDI）这样的特定场景时，最优方案可能无法实现，但TRIP仍能找到接近最优的解决方案，确保了期望的性能。实验证据来自对四个数据集的大量实验，结果显示，TRIP平均只需检索大约20%的缺失值，却能达到与基于检索方法相当的高召回率。这表明，TRIP不仅有效地减少了外部资源的消耗，还能保持良好的插补效果，对于实际应用中的数据完整性维护具有重要意义。TRIP为数据插补领域提供了一个有效且资源节约的解决方案，对于提升数据质量、降低维护成本具有积极的影响。

IEEE TRANSACTION ON KNOWLEDGE AND DATA ENGINEERING 3

missing values. Nonetheless, our proposed approach can

achieve an “expected-optimal” scheme.

In summary, our main contributions are as follows:

1) We propose TRIP, a hybrid retrieving-inferring impu-

tation approach for missing non-quantitive string data.

It ﬁlls an incomplete table by alternately retrieving and

inferring, which reaches as high imputation recall as the

web-based retrieving approach with much less cost.

2) We formulate the problem of identifying the optimal

scheduling scheme with TRIP in DDI and SDI, and then

deﬁne the optimal schemes in both contexts.

3) We devise a heuristic algorithm to identify an optimal

scheme in DDI. We also propose a greedy algorithm to

identify an expected-optimal scheme in SDI.

Roadmap ∶ We give preliminaries on previous approaches

and then deﬁne the problem in TRIP in Sec. 2. We discuss

on ﬁnding optimal scheduling schemes for TRIP in DDI in

Sec. 3, and then generalize the situation to SDI in Sec. 4.

The experiments are reported in Sec. 5, and related work

is covered in Sec. 6. We conclude in Sec. 7.

2 PRELIMINARIES AND PROBLEMS

In this section, we brieﬂy review inferring-based approach-

es and retrieving-based approaches for missing string data

imputation, and then deﬁne the problem in TRIP.

2.1 Inferring-based Approaches

Many inferring-based approaches are for missing quanti-

tive data, which will not be discussed in this paper. For

non-quantitive data, i.e., missing string data, the existing

Inferring-based approaches [35], [31] rely on a variety of

constraints such as Functional Dependencies (FDs) [1],

Conditional FDs [6], [11], Approximate FDs [28], [19]

and Inclusion Dependencies (INCs) [5] to infer missing

values from the complete part of the data set. For example,

given a table with four attributes [A, B, C, D], assume an

approximate CFD (A, B, C =“c

”→ D) hold with 0.85

conﬁdence, which means that when the value of attribute

C equals to “c

”, 85% of the time that the values of A

and B will together decide the value of D in the same

tuple. Hence, if there is one tuple T

= [a

, b

, c

, d

, ⋯],

and another tuple T

= [a

, b

, c

, ◻], where ◻ denotes a

missing value under D, we can infer that ◻ = d

with a

0.85 conﬁdence.

However, pure inferring-based imputation method often

fails to ﬁll in some missing values as not all missing values

are inferable based on the inference rules corresponding to

those constraints. We say a missing value is inferable if

there is at least one way (i.e., one inference rule) to infer

its value from the other existing or inferable values.

2.2 Retrieving-based Approaches

Many retrieving-based approaches [23], [21], [10], [16],

[36], [17], [29] focus on retrieving missing values from

web lists or web tables only, but ignore all the plain-text

web documents. Recently, a general web-based retrieving

approach [23], [24] was proposed to retrieve missing values

from all kinds of web pages, which was proved to reach

a higher imputation recall than the other approaches. For

each blank in a data set, this general approach formulates

proper imputation search queries to retrieve relevant web

documents, from which the missing value will then be

extracted [25] to ﬁll in the blank in the data set.

Similar to inference rules, a qualiﬁed imputation query

for a missing value can be formulated according to the

attribute dependencies holding on the data set. For example,

given Name → Email, we can formulate an imputation

query for a missing Email with the Name value in the

same instance. For instance, assume there is one tuple T

, ◻, ⋯], where a

is a value under Name, and ◻ is a

blank under Email, the formulated imputation queries can

be (“a

’ email is”) or (“a

+ email + mailed to”), where

the used auxiliary information like the pattern (“xxx’s email

is”) [2], [7] or the context terms “email”, “mailed to” [22]

can be learned from complete instances, and then used for

better describing the required context.

Retrieving-based approaches can reach a much higher

imputation recall than inferring-based approaches on a wide

range of databases [23], [24], but they also cause a large

overhead for issuing a large number of queries to the Web.

2.3 TRIP: Interactive Retrieving & Inferring

As observed from our experiments, the cost of an inferring

operation (<0.001s) is negligible compared to that of a

web-based retrieving operation (nearly 1s). Based on this

observation, we propose an interactive retrieving and in-

ferring data imputation approach, TRIP, which can beneﬁt

from the high recall of retrieving-based approach and the

efﬁciency of inferring-based approach. While an inferring

step in TRIP ﬁlls in all currently inferable missing values to

the greatest extent, the succeeding retrieving step retrieves

a set of selected missing values that make some unﬁlled

missing values become inferable for the next inferring step.

Inferring and retrieving are alternately performed until no

more missing values can be imputed.

The interaction between retrieving and inferring can be

simply represented by a sequence of missing value sets,

each of which contains values either for retrieving at a

retrieving step or for inferring at an inferring step. We call

this sequence of missing value sets as scheduling scheme.

Given an incomplete table with a set of missing values (or

blanks) O, an imputation scheduling scheme w.r.t. O can be

denoted as S = ⟨I

, R

, I

, R

, I

, ⋯, R

, I

⟩, where R

is a set of missing values for retrieving in the i-th retrieving

step and I

is a set of missing values for inferring in the

i-th inferring step, ∀i ≠ j, R

∩ I

= R

∩ I

= R

∩ R

∩ I

= ∅, and ∀i, R

⊆ O, I

⊆ O.

Let O

′

(S) = (

⋃

0≤i≤n

)

⋃

(

⋃

1≤i≤n

) denote the set of

ﬁlled values by TRIP following scheme S. The imputation

recall of TRIP by using S is: recall(S) =

∣O

′

(S)∣

∣O∣

, while its

cost can be roughly represented by the number of retrieved

values in S, i.e., cost(S) =

∑

1≤i≤n

∣R

∣, where ∣ ⋅ ∣ is the

size of a set. Our task is to identify an optimal scheduling

scheme that results in the highest imputation recall with the

minimum cost. More formally,

剩余13页未读，继续阅读

weixin_38632763

粉丝: 7
资源: 944

TRIP：交互检索-推断数据补全法：提升效率与召回率

内存接口电气验证和调试:DDRA 和 DDR-LP4 产品技术资料

西门子PLC工程实例源码第124期：S7-200直线插补程序.rar

S7-200直线插补程序.zip西门子PLC编程实例程序源码下载

我有两个伺服，走的ethercat协议，用PLC的ST语言帮我写一个伺服电机直线插补算法程序，注意：没有指令库，需要自己写算法

在数据插补种，如何根据不同的数据类型、数据特征和缺失情况，选择合适的数据插补方法

缺失值处理--最近邻插补

用PLC梯形图写出直线插补程序

比较直线插补和圆弧插补的异同

python多重插补流程图

风险分区识别系统的数据收集和预处理有哪些常用的方法？

最新资源