IEEE TRANSACTION ON KNOWLEDGE AND DATA ENGINEERING 3
missing values. Nonetheless, our proposed approach can
achieve an “expected-optimal” scheme.
In summary, our main contributions are as follows:
1) We propose TRIP, a hybrid retrieving-inferring impu-
tation approach for missing non-quantitive string data.
It fills an incomplete table by alternately retrieving and
inferring, which reaches as high imputation recall as the
web-based retrieving approach with much less cost.
2) We formulate the problem of identifying the optimal
scheduling scheme with TRIP in DDI and SDI, and then
define the optimal schemes in both contexts.
3) We devise a heuristic algorithm to identify an optimal
scheme in DDI. We also propose a greedy algorithm to
identify an expected-optimal scheme in SDI.
Roadmap ∶ We give preliminaries on previous approaches
and then define the problem in TRIP in Sec. 2. We discuss
on finding optimal scheduling schemes for TRIP in DDI in
Sec. 3, and then generalize the situation to SDI in Sec. 4.
The experiments are reported in Sec. 5, and related work
is covered in Sec. 6. We conclude in Sec. 7.
2 PRELIMINARIES AND PROBLEMS
In this section, we briefly review inferring-based approach-
es and retrieving-based approaches for missing string data
imputation, and then define the problem in TRIP.
2.1 Inferring-based Approaches
Many inferring-based approaches are for missing quanti-
tive data, which will not be discussed in this paper. For
non-quantitive data, i.e., missing string data, the existing
Inferring-based approaches [35], [31] rely on a variety of
constraints such as Functional Dependencies (FDs) [1],
Conditional FDs [6], [11], Approximate FDs [28], [19]
and Inclusion Dependencies (INCs) [5] to infer missing
values from the complete part of the data set. For example,
given a table with four attributes [A, B, C, D], assume an
approximate CFD (A, B, C =“c
0
”→ D) hold with 0.85
confidence, which means that when the value of attribute
C equals to “c
0
”, 85% of the time that the values of A
and B will together decide the value of D in the same
tuple. Hence, if there is one tuple T
1
= [a
3
, b
4
, c
0
, d
1
, ⋯],
and another tuple T
2
= [a
3
, b
4
, c
0
, ◻], where ◻ denotes a
missing value under D, we can infer that ◻ = d
1
with a
0.85 confidence.
However, pure inferring-based imputation method often
fails to fill in some missing values as not all missing values
are inferable based on the inference rules corresponding to
those constraints. We say a missing value is inferable if
there is at least one way (i.e., one inference rule) to infer
its value from the other existing or inferable values.
2.2 Retrieving-based Approaches
Many retrieving-based approaches [23], [21], [10], [16],
[36], [17], [29] focus on retrieving missing values from
web lists or web tables only, but ignore all the plain-text
web documents. Recently, a general web-based retrieving
approach [23], [24] was proposed to retrieve missing values
from all kinds of web pages, which was proved to reach
a higher imputation recall than the other approaches. For
each blank in a data set, this general approach formulates
proper imputation search queries to retrieve relevant web
documents, from which the missing value will then be
extracted [25] to fill in the blank in the data set.
Similar to inference rules, a qualified imputation query
for a missing value can be formulated according to the
attribute dependencies holding on the data set. For example,
given Name → Email, we can formulate an imputation
query for a missing Email with the Name value in the
same instance. For instance, assume there is one tuple T
2
=
[a
1
, ◻, ⋯], where a
1
is a value under Name, and ◻ is a
blank under Email, the formulated imputation queries can
be (“a
1
’ email is”) or (“a
1
+ email + mailed to”), where
the used auxiliary information like the pattern (“xxx’s email
is”) [2], [7] or the context terms “email”, “mailed to” [22]
can be learned from complete instances, and then used for
better describing the required context.
Retrieving-based approaches can reach a much higher
imputation recall than inferring-based approaches on a wide
range of databases [23], [24], but they also cause a large
overhead for issuing a large number of queries to the Web.
2.3 TRIP: Interactive Retrieving & Inferring
As observed from our experiments, the cost of an inferring
operation (<0.001s) is negligible compared to that of a
web-based retrieving operation (nearly 1s). Based on this
observation, we propose an interactive retrieving and in-
ferring data imputation approach, TRIP, which can benefit
from the high recall of retrieving-based approach and the
efficiency of inferring-based approach. While an inferring
step in TRIP fills in all currently inferable missing values to
the greatest extent, the succeeding retrieving step retrieves
a set of selected missing values that make some unfilled
missing values become inferable for the next inferring step.
Inferring and retrieving are alternately performed until no
more missing values can be imputed.
The interaction between retrieving and inferring can be
simply represented by a sequence of missing value sets,
each of which contains values either for retrieving at a
retrieving step or for inferring at an inferring step. We call
this sequence of missing value sets as scheduling scheme.
Given an incomplete table with a set of missing values (or
blanks) O, an imputation scheduling scheme w.r.t. O can be
denoted as S = ⟨I
0
, R
1
, I
1
, R
2
, I
2
, ⋯, R
n
, I
n
⟩, where R
i
is a set of missing values for retrieving in the i-th retrieving
step and I
i
is a set of missing values for inferring in the
i-th inferring step, ∀i ≠ j, R
i
∩ I
i
= R
i
∩ I
j
= R
i
∩ R
j
=
I
i
∩ I
j
= ∅, and ∀i, R
i
⊆ O, I
i
⊆ O.
Let O
′
(S) = (
⋃
0≤i≤n
I
i
)
⋃
(
⋃
1≤i≤n
R
i
) denote the set of
filled values by TRIP following scheme S. The imputation
recall of TRIP by using S is: recall(S) =
∣O
′
(S)∣
∣O∣
, while its
cost can be roughly represented by the number of retrieved
values in S, i.e., cost(S) =
∑
1≤i≤n
∣R
i
∣, where ∣ ⋅ ∣ is the
size of a set. Our task is to identify an optimal scheduling
scheme that results in the highest imputation recall with the
minimum cost. More formally,