MAGE：混合数据语义保留K-匿名新方法

149 浏览量更新于2024-08-26 收藏 1.05MB PDF 举报

"MAGE：一种用于混合数据的语义保留K-匿名方法" 在数据挖掘过程中，为了保护个人隐私，K-匿名性是一种广泛应用的方法。K-匿名性要求每个数据集中的敏感信息至少与另外k-1个记录共享相同的属性值，从而使得攻击者无法确定哪个记录属于特定个体。然而，传统的K-匿名方法如微聚合和泛化在处理混合数据（包含数值和分类数据）时存在不足，可能丧失大量有用信息。微聚合是将数据分组并计算组内平均值或中位数来达到匿名效果，但它可能丢失数值数据的细节。泛化则是通过将分类数据提升到更一般的层次，比如将“男性”和“女性”都归为“性别”，但这种方法可能导致分类数据的语义信息损失。为了解决这些缺陷，研究人员提出了MAGE（Mean Aggregation with Generalization for Enhanced semantics）方法。MAGE结合了数值数据的均值向量和分类数据的泛化值来创建聚类质心，作为元组的代表。这种方法试图在保持数据的语义价值的同时，实现对混合数据的有效匿名化。为了实现MAGE，文章中介绍了一种名为TSCKA（Two-Stage Clustering and K-Anonymization）的算法。TSCKA算法首先进行两阶段聚类，然后对每个聚类应用K-匿名原则。这种算法能够在数据质量和算法效率之间找到平衡，避免过度泛化或信息损失。实验结果显示，MAGE和TSCKA相比于已知的匿名算法，如Incognito和KACA，在处理混合微数据时，能更好地保留语义信息并有效地实现匿名化。这表明MAGE和TSCKA是处理混合数据的有力工具，有助于在保护隐私和数据实用性之间找到更好的平衡点。 MAGE是针对混合数据的K-匿名方法的一个重要进步，它结合了数值数据和分类数据的特点，提高了匿名化的语义保留程度，而TSCKA算法则在实际操作中提供了高效的数据处理策略。这对于数据挖掘和隐私保护领域具有重要意义，尤其对于需要处理混合数据的场景，如医疗、金融和社交媒体等，MAGE和TSCKA的引入能够更好地平衡隐私保护与数据分析的需求。

distortion of a k-anonymous view. Table 1 gives a good example to

illustrate the differences between global recoding and local recod-

ing. Fig. 1 gives domain and value generalization hierarchies of

attributes Gender and Pcode.

Many local recoding algorithms have been proposed to achieve

k-anonymity, such as K-Anonymization by Clustering in Attribute

hierarchies algorithm (KACA) [6], Top-Down algorithm [7], etc.

Among these algorithms, clustering-based generalization algo-

rithms are a kind of ﬁne local recoding methods for k-anonymizing

microdata. The main idea of the clustering-based generalization

algorithms is to partition original dataset into some equivalence

classes based on the predeﬁned distance, and then generalize all

tuples in each equivalence class into their common generalized

tuples. To achieve clustering, we need to deﬁne some

measurements to measure the distance between tuples. To the best

of our knowledge, the weighted hierarchical distance [6] based on

generalized tree is a reasonable measurement for generalization

distance between tuples. We deﬁne the generalization distance

which is similar to the weighted hierarchical distance in next

section.

2.2.2. Distance measurement in generalization

In order to deﬁne the distance between tuples in generalization

table, we need to deﬁne the concepts of closest common general-

ization, common generalization tuple, and distortion of generaliza-

tion of tuple. In this section, we ﬁrst give the deﬁnitions of the

three concepts, and then deﬁne the generalization distance be-

tween two tuples.

Deﬁnition 3 (Closest Common Generalization). Let A be an attribute

of table T, VGHT

be the value generalization hierarchy tree of A, a

and a

be two values of attribute A. The closest common gener-

alization of a

and a

(denoted by CCG (a

)) is deﬁned as

CCGða

Þ¼

if a

¼ a

;

the closest ancestor of a

and a

in VGHT

otherwise



ð1Þ

Deﬁnition 4 (Common Generalization Tuple). Let t

and t

be two

tuples of table T, QI ={A

,...,A

} be the quasi-identiﬁer of T, t.A

be the value of tuple t on attribute A

. Common generalization tuple

(CGT) t

of t

and t

is deﬁned as (2).

¼ðCCGðt

 A

; t

 A

Þ; ...; CCGðt

 A

; t

 A

ÞÞ ð2Þ

Deﬁnition 5 (Distortions of Generalization of Tuples). Let t be a

tuple of table T, t

be generalization tuple of t, QI ={A

,...,A

}be

quasi-identiﬁer of T, t  A

be the value of tuple t on attribute A

VGHT

; ...; VGHT

be the value generalization hierarchy trees of

QI. The distortion of t generalized to t

is deﬁned as (3).

distortionðt; t

Þ¼¼

i¼1

 le

elðt

 A

 1Þ

hðVGHT

ð3Þ

where hðVGHT

Þ denotes the height of VGHT

; le

elðt

 A

Þ denotes

the level of t

 A

on VGHT

; x

denotes weight of attribute A

For example, let attribute Gender be in hierarchy of {male/

female,}, attribute Pcode be in hierarchy of {dddd,ddd

⁄

,dd

⁄⁄

⁄⁄⁄

⁄

be 1. t

={female,4661} and t

¼f, 466

⁄

}. Then

distortion t

; t



¼ 1=2 þ 1=5 ¼ 0:7.

Deﬁnition 6 (Generalization Distance between Two Tuples). Let

T(A

,...,A

) be a table and QI ={A

,...,A

} be the quasi-identiﬁer

of table T. Given two tuples t

and t

of T and their common

generalization tuple t

, the generalization distance between t

and t

is deﬁned as (4).

dist

gen

ðt

; t

Þ¼distortionðt

; t

Þþdistortionðt

; t

Þð4Þ

For example, let attribute Gender be in hierarchy of {male/

female,}, attribute Pcode be in hierarchy of {dddd, ddd

⁄

,dd

⁄⁄

⁄⁄⁄

⁄

be 1. t

= {female, 4661} and t

= {male,4663},

⁄

,466

⁄

Dist

gen

)=distortion(t

)+distortion(t

) = 0.7 + 0.7 = 1.4.

Table 1

Global recoding and local recoding example.

(a) An original table (b) A 2-anonymous view by local recoding (c) A 2-anonymous view by global recoding

Gender Age Pcode Problem Gender Age Pcode Problem Gender Age Pcode Problem

Female 35 4661 Stress

⁄

[30,39] 466

⁄

Stress

⁄

[20,39] 466

⁄

Stress

Male 36 4663 Obesity

⁄

[30,39] 466

⁄

Obesity

⁄

[20,39] 466

⁄

Obesity

Female 37 4663 Obesity

⁄

[30,39] 466

⁄

Obesity

⁄

[20,39] 466

⁄

Obesity

Female 21 4354 Stress Female [20,v29] 4354 Stress

⁄

[20,39] 435

⁄

Stress

Female 25 4354 Obesity Female [20, 29] 4354 Obesity

⁄

[20,39] 435

⁄

Obesity

Female 55 4331 Stress Female [50,59] 4331 Stress

⁄

[40,59] 433

⁄

Stress

Female 57 4331 Obesity Female [50, 59] 4331 Obesity

⁄

[40,59] 433

⁄

Obesity

Female 67 4652 Stress

⁄

[60,69] 465

⁄

Stress

⁄

[60,79] 465

⁄

Stress

Female 69 4653 Obesity

⁄

[60,69] 465

⁄

Obesity

⁄

[60,79] 465

⁄

Obesity

Male 68 4653 Stress

⁄

[60,69] 465

⁄

Stress

⁄

[60,79] 465

⁄

Stress

Male 48 4354 Obesity Male [40,59] 4354 Obesity

⁄

[40,59] 435

⁄

Obesity

Male 54 4354 Stress Male [40, 59] 4354 Stress

⁄

[40,59] 435

⁄

Stress

(a)DGH

Gender

(b)VGH

Gender

Pcode

(d) VGH

Pcode

g1={*}

g0={male,female}

male female

z4={*}

z3={4***}

z2={43**,46**}

z1={435*,433*,465*,466*}

z0={4354,4331, 4652,

4653,4661,4663}

}

4***

43** 46**

435* 433* 465* 466*

4354 4331 4652 4653 4663 4661

Fig. 1. Domain and value generalization hierarchies of Gender and Pcode.

J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

剩余11页未读，继续阅读

weixin_38506798

粉丝: 4
资源: 937

MAGE：混合数据语义保留K-匿名新方法

基于语义保留CEP的业务数据处理方法

基于聚类的数据敏感属性匿名保护算法 (2012年)

使我成为编程专家的文件:mage:‍:mage:-Linux开发

grasshopper:一个非常严肃的游戏，涉及两个足球，飞盘和魔术:mage:‍:mage:

InvisibleJs:使用密码安全地在纯文本中使用零宽度字符隐藏秘密:mage:‍:mage::star:

The Ice Mage:用于休闲跑步游戏的 Android 应用-开源

The Fire Mage:用于休闲跑步游戏的 Android 应用-开源

The Lightening Mage:用于休闲跑步游戏的 Android 应用-开源

半自动图像注释工具：Anno-Mage：一种半自动图像注释工具，它通过使用预先训练的模型为80个对象类建议注释来帮助您注释图像

pagemaster：Jekyll插件，用于从CSVYAMLJSON记录生成Markdown收集页面:mage::open_book:

最新资源