机器学习模型的新型数据集关联攻击：揭示输入变量间的隐含联系

版权申诉

120 浏览量更新于2024-07-07 收藏 1.29MB PDF 举报

随着机器学习在商业和组织中的广泛应用，越来越多地被用于自动化任务和决策过程，数据集隐私问题引起了广泛关注。训练有素的机器学习模型往往会无意中泄露关于数据集中个体的信息，甚至全球数据集的概况。本文关注的是对机器学习模型的新型威胁——数据集相关推理攻击（Dataset Correlation Inference Attacks）。在传统的数据属性推断攻击中，攻击者试图从模型的行为或输出推测数据的某些特征。然而，作者Ana-Maria Cret、Florent Guépin和Yves-Alexandre de Montjoye在此研究中提出了一个更为深入的威胁，即利用输入变量之间的相关性进行攻击。他们指出，机器学习模型通常采用球面参数化来表示相关矩阵，这为攻击者设定了关于相关系数的界限，使其能够有根据地进行猜测。攻击者的目标是通过仅利用模型中输入变量与目标变量之间的关联，来推断出数据集中未公开变量之间的潜在关系。这种攻击不仅涉及个体隐私，还可能揭示整个数据集的结构和模式。为了实施这种攻击，研究人员首先展示了如何利用数学工具和技术，如线性代数和概率统计，来量化和操纵这些相关性。他们可能还会探讨模型的特性，如神经网络中的权重分布或决策树的节点连接，这些都可能成为攻击者窥探数据集关联性的线索。值得注意的是，这种攻击可能会对敏感领域的应用造成严重后果，比如金融、医疗或国家安全领域，其中数据集中的信息具有高度的保密性和隐私性。为了应对这一威胁，研究人员提出可能的防御策略，包括使用更复杂的模型架构来混淆输入变量之间的关系，或者在模型训练过程中采取隐私保护技术，如差分隐私或同态加密，以限制攻击者获取有用信息的能力。数据集相关推理攻击揭示了现代机器学习模型在处理隐私数据时面临的挑战，它强调了在设计和部署这些系统时必须重视数据安全和隐私保护。研究人员和实践者需要共同努力，发展有效的防御措施，以确保在利用机器学习的力量的同时，保护个人和组织的数据免受此类攻击的侵犯。

Written in closed form, the coefﬁcients of B are equal to:

i,j











1 for i = 1, j = 1

cos θ

i,j

for i ≥ 2, j = 1

cos θ

i,j

j−1

k=1

sin θ

i,k

for 2 ≤ j ≤ i − 1

j−1

k=1

sin θ

i,k

for i = j, 2 ≤ i, j ≤ n

0 for i + 1 ≤ j ≤ n

(2)

The spherical parametrization of B allows to describe a

correlation matrix using only

n×(n−1)

parameters, namely the

angles θ

i,j

, 1 ≤ i ≤ n, i < j < n.

Numpacharoen and Atsawarungruangkit [36] introduced an

algorithm to generate random correlation matrices based on the

spherical parameterization, building on prior work [35], [37].

The key insight is that the correlation coefﬁcients c

i,j

can be

expressed as sums of products between cosines and sines of

i,j

, by developing the computation of c

i,j

= (BB

)

i,j

. As

a result, each c

i,j

lies within a boundary determined by the

angles θ

p,q

for 1 ≤ p ≤ i and 1 ≤ q < j. This insight can

be used to generate a valid correlation matrix by sampling

the correlation coefﬁcients one by one, uniformly within the

boundaries derived from the values previously sampled.

Algorithm 1 provides a high-level description of the proce-

dure to generate a random correlation matrix using boundaries

of their coefﬁcients [36]. The ﬁrst column of the matrix is

initialized uniformly at random within its boundaries, namely

the interval [−1, 1] (lines 4-5). The correlation coefﬁcients

are sampled, in order, from top to bottom and from left

to right. The elements of the ﬁrst column are initialized

uniformly at random c

i,1

∼ U([−1, 1]), i = 1, . . . , n. When

sampling the correlation c

i,j

with i > 1, j ≥ 2, the previously

sampled correlation coefﬁcients restrict the values for angles

p,q

, for 1 ≤ p ≤ i and 1 ≤ q < j, allowing to derive the

boundaries for c

i,j

. For complete details about this procedure,

we refer the reader to Algorithm 4 in the Appendix. To ensure

that every correlation coefﬁcient in the correlation matrix is

equally distributed (i.e.that their CDF is almost identical), the

algorithm shufﬂes them at the end (lines 9-12).

B. Copulas

We describe the Gaussian copulas generative model, which

can be used to generate datasets of n variables with a variety

of dependencies between the variables.

Marginal distribution. Consider a random variable X

taking values in R. We denote by marginal distribution its cu-

mulative distribution function (CDF): F : R → [0, 1], F (x) =

P (X ≤ x). One example is the marginal of a standard normal

distribution, which we denote by Φ:

Φ(x) =

√

2π

−∞

−

Gaussian multivariate distribution. We denote by

N(0, Σ) the Gaussian multivariate distribution with mean 0

and covariance matrix Σ. Its CDF is equal to:

, . . . , x

) =

−∞

. . .

−∞

−

−1

(2π)

det(Σ)

Algorithm 1 GENERATERANDOMCORRELATIONMATRIX

1: Inputs:

n: The number of variables.

K: A threshold for computational stability.

2: Output:

C ∈ R

n×n

: A valid triangular correlation matrix.

3: // Randomly initialize C’s first column.

4: for i ∈ {2, ··· , n} do

5: c

σ(i),1

← U([−1, 1]) // cos θ

σ(i),1

6: end for

7: constraints ← (c

σ(2),1

, . . . , c

σ(n),1

)

8: C ← FILLCORRELATIONMATRIX(n, K, constraints)

9: // Shuffle the variables.

10: σ ← random permutation([1, . . . , n])

11: C ← reorder rows(σ)

12: C ← reorder columns(σ)

Copulas. Copulas denote the set of multivariate cumulative

distribution functions F

: [0, 1]

→ [0, 1] over continuous

random vectors (X

, . . . , X

) such that the marginal of each

variable satisﬁes F

(x) = x, i.e., is uniformly distributed

in the interval [0, 1]. Sklar’s theorem [38], [39] states the

fundamental result that for any random variables X

, . . . , X

with continuous marginals F

, . . . , F

, their joint probability

distribution can be described in terms of the marginals and a

copula F

modeling the dependencies between the variables.

To see why, let us consider a continuous random vector

, . . . , X

) with CDF F and marginals F

. Using the

fact that the random variable U

= F

−1

) is uniformly

distributed in the interval [0, 1], it follows that the CDF of

, . . . , U

) is equal to P (U

≤ u

, . . . , U

≤ u

) =

Pr(X

≤ F

−1

), . . . , X

≤ F

−1

)). As a result, the

copula of (X

, . . . , X

) can be written as follows:

, . . . , u

) = F (F

−1

), . . . , F

−1

))

This result can be used to generate samples from F when

both the copula and the marginals of the variables are known.

A variety of copula-based generative models, including Gaus-

sian copulas which we use in this paper, assume that the copula

belongs to a given restricted family.

Gaussian copulas. Given an n-dimensional correlation

matrix C, its Gaussian copula is deﬁned as:

, . . . , u

) = Φ

(Φ

−1

), . . . , Φ

−1

)) (3)

∀u = (u

, . . . , u

) ∈ [0, 1]

(4)

Alg. 2 describes a procedure to generate a sample Y =

, . . . , Y

) from a distribution satisfying the following two

properties: (1) the marginals of the distribution are F

, . . . , F

and (2) its dependencies are given by the Gaussian copula F

Note that for arbitrary marginals, the correlations between the

variables Y

, . . . , Y

are not necessarily equal to the correla-

tions C. However, when the marginals are standard normals

= Φ, i = 1, . . . , n, the correlations are the same. Some

剩余14页未读，继续阅读

易小侠

粉丝: 6611
资源: 9万+

机器学习模型的新型数据集关联攻击：揭示输入变量间的隐含联系

网络名人皮肤检测数据集-Face_Dataset

FFRI_Dataset_2018数据集介绍：日本信息学会提供的免费下载资源

研究小组发布的rg_dataset数据集介绍

机器学习深度学习数据集_Dataset.zip

手动标记的人脸和戴口罩的人脸数据集（用于对象检测模型）_face_mask_dataset.zip

DUKE大学OCT数据集Chiu_BOE_2014_dataset及分割出的标注

dataset_person.zip_Apriori_apriori DATASET_apriori数据集_dataset_pe

深度神经网络的工具类和数据集dnn_utils_v2_lr_utils_dataset

水下垃圾数据集_分类_underwater_dataset.zip

数据集的相关制作程序_Dataset_make.zip

最新资源