distortion of a k-anonymous view. Table 1 gives a good example to
illustrate the differences between global recoding and local recod-
ing. Fig. 1 gives domain and value generalization hierarchies of
attributes Gender and Pcode.
Many local recoding algorithms have been proposed to achieve
k-anonymity, such as K-Anonymization by Clustering in Attribute
hierarchies algorithm (KACA) [6], Top-Down algorithm [7], etc.
Among these algorithms, clustering-based generalization algo-
rithms are a kind of fine local recoding methods for k-anonymizing
microdata. The main idea of the clustering-based generalization
algorithms is to partition original dataset into some equivalence
classes based on the predefined distance, and then generalize all
tuples in each equivalence class into their common generalized
tuples. To achieve clustering, we need to define some
measurements to measure the distance between tuples. To the best
of our knowledge, the weighted hierarchical distance [6] based on
generalized tree is a reasonable measurement for generalization
distance between tuples. We define the generalization distance
which is similar to the weighted hierarchical distance in next
section.
2.2.2. Distance measurement in generalization
In order to define the distance between tuples in generalization
table, we need to define the concepts of closest common general-
ization, common generalization tuple, and distortion of generaliza-
tion of tuple. In this section, we first give the definitions of the
three concepts, and then define the generalization distance be-
tween two tuples.
Definition 3 (Closest Common Generalization). Let A be an attribute
of table T, VGHT
A
be the value generalization hierarchy tree of A, a
1
and a
2
be two values of attribute A. The closest common gener-
alization of a
1
and a
2
(denoted by CCG (a
1
,a
2
)) is defined as
CCGða
1
;a
2
Þ¼
a
1
if a
1
¼ a
2
;
the closest ancestor of a
1
and a
2
in VGHT
A
otherwise
ð1Þ
Definition 4 (Common Generalization Tuple). Let t
i
and t
j
be two
tuples of table T, QI ={A
1
,...,A
q
} be the quasi-identifier of T, t.A
i
be the value of tuple t on attribute A
i
. Common generalization tuple
(CGT) t
0
of t
i
and t
j
is defined as (2).
t
0
¼ðCCGðt
i
A
1
; t
j
A
1
Þ; ...; CCGðt
i
A
q
; t
j
A
q
ÞÞ ð2Þ
Definition 5 (Distortions of Generalization of Tuples). Let t be a
tuple of table T, t
0
be generalization tuple of t, QI ={A
1
,...,A
q
}be
quasi-identifier of T, t A
i
be the value of tuple t on attribute A
i
,
VGHT
A
1
; ...; VGHT
A
q
be the value generalization hierarchy trees of
QI. The distortion of t generalized to t
0
is defined as (3).
distortionðt; t
0
Þ¼¼
X
q
i¼1
x
A
i
le
v
elðt
0
A
i
1Þ
hðVGHT
A
i
Þ
ð3Þ
where hðVGHT
A
i
Þ denotes the height of VGHT
A
i
; le
v
elðt
0
A
i
Þ denotes
the level of t
0
A
i
on VGHT
A
i
; x
A
i
denotes weight of attribute A
i
.
For example, let attribute Gender be in hierarchy of {male/
female,}, attribute Pcode be in hierarchy of {dddd,ddd
⁄
,dd
⁄⁄
,
d
⁄⁄⁄
,
⁄
},
x
A
i
be 1. t
1
={female,4661} and t
0
1
¼f, 466
⁄
}. Then
distortion t
1
; t
0
1
¼ 1=2 þ 1=5 ¼ 0:7.
Definition 6 (Generalization Distance between Two Tuples). Let
T(A
1
,A
2
,...,A
n
) be a table and QI ={A
1
,...,A
q
} be the quasi-identifier
of table T. Given two tuples t
i
and t
j
of T and their common
generalization tuple t
0
, the generalization distance between t
i
and t
j
is defined as (4).
dist
gen
ðt
i
; t
j
Þ¼distortionðt
i
; t
0
Þþdistortionðt
j
; t
0
Þð4Þ
For example, let attribute Gender be in hierarchy of {male/
female,}, attribute Pcode be in hierarchy of {dddd, ddd
⁄
,dd
⁄⁄
,
d
⁄⁄⁄
,
⁄
},
x
A
i
be 1. t
1
= {female, 4661} and t
2
= {male,4663},
t
0
={
⁄
,466
⁄
}.
Dist
gen
(t
1
,t
2
)=distortion(t
1
,t
0
)+distortion(t
2
,t
0
) = 0.7 + 0.7 = 1.4.
Table 1
Global recoding and local recoding example.
(a) An original table (b) A 2-anonymous view by local recoding (c) A 2-anonymous view by global recoding
Gender Age Pcode Problem Gender Age Pcode Problem Gender Age Pcode Problem
Female 35 4661 Stress
⁄
[30,39] 466
⁄
Stress
⁄
[20,39] 466
⁄
Stress
Male 36 4663 Obesity
⁄
[30,39] 466
⁄
Obesity
⁄
[20,39] 466
⁄
Obesity
Female 37 4663 Obesity
⁄
[30,39] 466
⁄
Obesity
⁄
[20,39] 466
⁄
Obesity
Female 21 4354 Stress Female [20,v29] 4354 Stress
⁄
[20,39] 435
⁄
Stress
Female 25 4354 Obesity Female [20, 29] 4354 Obesity
⁄
[20,39] 435
⁄
Obesity
Female 55 4331 Stress Female [50,59] 4331 Stress
⁄
[40,59] 433
⁄
Stress
Female 57 4331 Obesity Female [50, 59] 4331 Obesity
⁄
[40,59] 433
⁄
Obesity
Female 67 4652 Stress
⁄
[60,69] 465
⁄
Stress
⁄
[60,79] 465
⁄
Stress
Female 69 4653 Obesity
⁄
[60,69] 465
⁄
Obesity
⁄
[60,79] 465
⁄
Obesity
Male 68 4653 Stress
⁄
[60,69] 465
⁄
Stress
⁄
[60,79] 465
⁄
Stress
Male 48 4354 Obesity Male [40,59] 4354 Obesity
⁄
[40,59] 435
⁄
Obesity
Male 54 4354 Stress Male [40, 59] 4354 Stress
⁄
[40,59] 435
⁄
Stress
(a)DGH
Gender
(b)VGH
Gender
(c) DGH
Pcode
(d) VGH
Pcode
g1={*}
g0={male,female}
*
male female
z4={*}
z3={4***}
z2={43**,46**}
z1={435*,433*,465*,466*}
z0={4354,4331, 4652,
4653,4661,4663}
*
4***
43** 46**
435* 433* 465* 466*
4354 4331 4652 4653 4663 4661
Fig. 1. Domain and value generalization hierarchies of Gender and Pcode.
J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86
77