几何数据扰动：隐私保护的外包数据挖掘新方法

需积分: 10 119 浏览量更新于2024-07-23 收藏 927KB PDF 举报

"这篇论文是陈可可在数据挖掘与隐私保护领域的经典研究，提出了几何数据扰动（Geometric Data Perturbation, GDP）方法，旨在在保护隐私的同时保持数据的实用性。该方法关注如何在扰动过程中保留任务/模型特定的信息，特别是多维几何信息，这些信息对许多数据挖掘模型至关重要。通过GDP方法，论文展示了即使经过扰动，常见的数据挖掘模型也能保持相当的模型质量。文中还进行了攻击分析和实验验证。" 在当今大数据时代，数据挖掘已经成为获取洞察力和决策支持的关键工具。然而，随着数据的广泛收集和共享，个人隐私保护问题变得日益突出。数据扰动作为一种隐私保护技术，通过在原始数据上添加噪声或变换来隐藏敏感信息。然而，过度的扰动可能严重影响数据的有用性，导致数据挖掘结果的准确性大幅下降。陈可可和刘玲的这篇论文提出了GDP方法，其核心思想是保留数据中的多维几何信息，这在许多数据挖掘模型中都是关键的。例如，聚类、分类和关联规则学习等模型通常依赖于数据点之间的距离和分布结构。通过在扰动过程中保留这些几何特性，GDP方法可以在不牺牲过多数据实用性的前提下增强隐私保护。论文详细阐述了GDP方法的多个方面，包括设计原理、实施步骤以及如何应用于不同类型的数据挖掘模型。作者们展示了GDP如何在保持模型性能的同时，有效地隐藏个体数据的敏感细节。此外，他们还进行了攻击分析，评估了GDP方法对各种潜在攻击（如逆向工程攻击）的抵抗力，证明了其在实际应用中的可行性和安全性。实验部分，作者们使用真实数据集对比了GDP与其他扰动技术的性能，进一步证实了GDP在隐私保护和数据挖掘效果之间取得的良好平衡。这些实证结果对于理解如何在隐私保护和数据利用之间找到合适的折衷具有重要意义，也为未来的隐私保护研究提供了有价值的参考。这篇论文为隐私保护和数据挖掘领域提供了一个创新的解决方案，即通过几何数据扰动在保护用户隐私的同时，尽可能地保留数据的有用性，从而促进更安全、更智能的数据分析实践。

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 7

Data Owner (A)

Service Provider (B)

Transform

Data before

releasing it

Mined models/patterns

reconstruct/

Transform back

f(X)

Fig. 2. Applying geometric data perturbation to out-

sourced data

training data and then is applied to classify the unclassiﬁed data. Suppose that X is a

training dataset consisting of N data rows (records) and d columns (attributes, or di-

mensions). For the convenience of mathematical manipulation, we use X

d×N

to denote

the dataset, i.e., X = [x

. . . x

], where x

is a data tuple, representing a vector in the

real space R

. Each data tuple x

belongs to a predeﬁned class if the data is for classi-

ﬁcation modeling, which is indicated by the class label attribute y

. The data for clus-

tering do not have labels. The class label can be nominal (or continuous for regression),

which is public, i.e., privacy-insensitive.All other attributes containing private informa-

tion needs to be protected. Unclassiﬁed dataset could also be exported/published with

privacy-protection if necessary.

If we consider X is a sample dataset from the d-dimension random vector [X

, X

. . . , X

]

, we use bold X

to represent the random variable for the column i. In general,

we will use bold lower case to represent vectors, bold upper case to represent random

variables, and regular upper case to represent matrices.

3.2. Framework and Threat Model for Applying Geometric Data

Perturbation

We study geometric data perturbation under the following framework (Figure 2). The

data owner wants to use the data mining service provider (or the public cloud service

provider). The outsourced data needs to be perturbed ﬁrst and then sent to the service

provider. Then, the service provider develops a model based on the perturbed data and

returns it to the data owner, who can use the model either by transforming it back to the

original space or perturb new data to use the model. In the middle of developing models

at the service provider, there is no additional interaction happening between the two

parties. Therefore, the major costs for the data owner incur in optimizing perturbation

parameters that can use a sample set of the data and perturbing the entire dataset.

We take the popular and reasonable honest-but-curious service provider approach

for our threat model. That is, we assume the service provider will honestly provide

the data mining services. However, we also assume that the provider might look at the

data stored and processed on their platforms. Therefore, only well-protected data can be

processed and stored on such an untrusted environment.

8 K. Chen and L. Liu

4. Deﬁnition of Geometric Data Perturbation

Geometric data perturbation consists of a sequence of random geometric transforma-

tions, including multiplicative transformation (R) , translation transformation (Ψ), and

distance perturbation ∆.

G(X) = RX + Ψ + ∆ (1)

We brieﬂy deﬁne these transformations and describe their properties.

4.1. Multiplicative Transformation

The component R can be rotation matrix (Chen and Liu, 2005) or random projection

matrix (Liu, Kargupta and Ryan, 2006). Rotation matrix exactly preserves distances

while random projection matrix only approximately preserve distances. We will com-

pare the advantages and disadvantages of the two choices.

It is intuitive to understand a rotation transformation in two-dimensional or three-

dimensional (2D or 3D, for short) space. We extend it to represent all kind of orthonor-

mal transformation in multi-dimensional space. A rotation perturbation is deﬁned as

follows: G(X) = RX. The matrix R

d×d

is an orthonormal matrix (Sadun, 2001),

which has some important properties. Let R

represent the transpose of R, r

repre-

sent the (i, j) element of R, and I be the identity matrix. Both rows and columns of R

are orthonormal: for any column j,

i=1

= 1, and for any two columns j and k,

j 6= k,

i=1

= 0; a similar property is held for rows. This deﬁnition infers that

R = RR

= I. It also implies that by changing the order of the rows or columns

of an orthogonal matrix, the resulting matrix is still orthonormal. A random orthonor-

mal matrix can be efﬁciently generated following the Haar distribution (Stewart, 1980),

which preserves some important statistical properties (Jiang, 2005).

A key feature of rotation transformation is preserving the Euclidean distance. Let

represent the transpose of vector x, and kxk = x

x represent the length of a vector

x. By the deﬁnition of rotation matrix, we have kRxk = kxk. Similarly, inner product

is also invariant to rotation. Let hx, yi = x

y represent the inner product of x and

y. We have hRx, Ryi = x

Ry = hx, yi. In general, rotation transformation also

completely preserves the geometric shapes such as hyperplane and manifold in the mul-

tidimensional space. Thus, many modeling methods are “rotation-invariant” as we will

see. Rotation perturbation is a key component of geometric perturbation, which pro-

vides the primary protection to the perturbed data from naive estimation attacks. Other

components of geometric perturbation are used to protect rotation perturbation from

more complicated attacks.

A random projection matrix (Vempala, 2005) R

k×d

is deﬁned as R =

. R

is randomly generated and its row vectors are orthonormal (note there is no such re-

quirement on column vectors). The Johnson-Lindenstrauss Lemma (Johnson and Lin-

denstrauss, 1984) proves that random projection can approximately preserve Euclidean

distances if certain conditions are satisﬁed. Concretely, let x and y be any original

data vectors. Given 0 < ǫ < 1 and k = O(ln(N )/ǫ

), there is a random projection

f : R

→ R

, so that (1 −ǫ)kx−yk ≤ kf(x)−f(y)k ≤ (1+ǫ)kx−yk. ǫ deﬁnes the

accuracy of distance preservation. Therefore, in order to precisely preserve distances,

k has to be large. For large dataset (N is large), it would be difﬁcult to well preserve

distances with computationally acceptable k. We will discuss the effect of random pro-

jection and rotation transformation to the result of perturbation.

剩余38页未读，继续阅读

lifebud

粉丝: 0
资源: 2

几何数据扰动：隐私保护的外包数据挖掘新方法

Geometric Data Structures for Computer Graphics

Unable to get repr for <class 'torch_geometric.data.data.Data'>

Geometric.Tools.for.Computer.Graphics

Springer.Computer.Graphics.and.Geometric.Modeling

Geometric.DFMPro.8.5.0.10926.for.NX1926-1980.Series_Win64-SSQ.rar

Klette R., Rosenfeld A. Digital Geometry.. Geometric Methods for Digital Image Analysis (Morgan Kaufmann, 2004)

from torch_geometric.nn import GCNConv from torch_geometric.data import Data报错

torch_geometric.data.Batch.from_data_list的作用，请举例说明

from torch_geometric.data import Data

torch_geometric.data

最新资源