8 K. Chen and L. Liu
4. Definition of Geometric Data Perturbation
Geometric data perturbation consists of a sequence of random geometric transforma-
tions, including multiplicative transformation (R) , translation transformation (Ψ), and
distance perturbation ∆.
G(X) = RX + Ψ + ∆ (1)
We briefly define these transformations and describe their properties.
4.1. Multiplicative Transformation
The component R can be rotation matrix (Chen and Liu, 2005) or random projection
matrix (Liu, Kargupta and Ryan, 2006). Rotation matrix exactly preserves distances
while random projection matrix only approximately preserve distances. We will com-
pare the advantages and disadvantages of the two choices.
It is intuitive to understand a rotation transformation in two-dimensional or three-
dimensional (2D or 3D, for short) space. We extend it to represent all kind of orthonor-
mal transformation in multi-dimensional space. A rotation perturbation is defined as
follows: G(X) = RX. The matrix R
d×d
is an orthonormal matrix (Sadun, 2001),
which has some important properties. Let R
T
represent the transpose of R, r
ij
repre-
sent the (i, j) element of R, and I be the identity matrix. Both rows and columns of R
are orthonormal: for any column j,
P
d
i=1
r
2
ij
= 1, and for any two columns j and k,
j 6= k,
P
d
i=1
r
ij
r
ik
= 0; a similar property is held for rows. This definition infers that
R
T
R = RR
T
= I. It also implies that by changing the order of the rows or columns
of an orthogonal matrix, the resulting matrix is still orthonormal. A random orthonor-
mal matrix can be efficiently generated following the Haar distribution (Stewart, 1980),
which preserves some important statistical properties (Jiang, 2005).
A key feature of rotation transformation is preserving the Euclidean distance. Let
x
T
represent the transpose of vector x, and kxk = x
T
x represent the length of a vector
x. By the definition of rotation matrix, we have kRxk = kxk. Similarly, inner product
is also invariant to rotation. Let hx, yi = x
T
y represent the inner product of x and
y. We have hRx, Ryi = x
T
R
T
Ry = hx, yi. In general, rotation transformation also
completely preserves the geometric shapes such as hyperplane and manifold in the mul-
tidimensional space. Thus, many modeling methods are “rotation-invariant” as we will
see. Rotation perturbation is a key component of geometric perturbation, which pro-
vides the primary protection to the perturbed data from naive estimation attacks. Other
components of geometric perturbation are used to protect rotation perturbation from
more complicated attacks.
A random projection matrix (Vempala, 2005) R
k×d
is defined as R =
q
d
k
R
0
. R
0
is randomly generated and its row vectors are orthonormal (note there is no such re-
quirement on column vectors). The Johnson-Lindenstrauss Lemma (Johnson and Lin-
denstrauss, 1984) proves that random projection can approximately preserve Euclidean
distances if certain conditions are satisfied. Concretely, let x and y be any original
data vectors. Given 0 < ǫ < 1 and k = O(ln(N )/ǫ
2
), there is a random projection
f : R
d
→ R
k
, so that (1 −ǫ)kx−yk ≤ kf(x)−f(y)k ≤ (1+ǫ)kx−yk. ǫ defines the
accuracy of distance preservation. Therefore, in order to precisely preserve distances,
k has to be large. For large dataset (N is large), it would be difficult to well preserve
distances with computationally acceptable k. We will discuss the effect of random pro-
jection and rotation transformation to the result of perturbation.