
3
where each row is an orthornormal basis vector b
i
with m
components. We can consider our naive basis as the effective
starting point. All of our data has been recorded in this basis
and thus it can be trivially expressed as a linear combination
of {b
i
}.
B. Change of Basis
With this rigor we may now state more precisely what PCA
asks: Is there another basis, which is a linear combination of
the original basis, that best re-expresses our data set?
A close reader might have noticed the conspicuous addition of
the word linear. Indeed, PCA makes one stringent but power-
ful assumption: linearity. Linearity vastly simplifies the prob-
lem by restricting the set of potential bases. With this assump-
tion PCA is now limited to re-expressing the data as a linear
combination of its basis vectors.
Let X be the original data set, where each column is a single
sample (or moment in time) of our data set (i.e.
~
X). In the toy
example X is an m ×n matrix where m = 6 and n = 72000.
Let Y be another m ×n matrix related by a linear transfor-
mation P. X is the original recorded data set and Y is a new
representation of that data set.
PX = Y (1)
Also let us define the following quantities.
1
• p
i
are the rows of P
• x
i
are the columns of X (or individual
~
X).
• y
i
are the columns of Y.
Equation 1 represents a change of basis and thus can have
many interpretations.
1. P is a matrix that transforms X into Y.
2. Geometrically, P is a rotation and a stretch which again
transforms X into Y.
3. The rows of P,
{
p
1
,...,p
m
}
, are a set of new basis vec-
tors for expressing the columns of X.
The latter interpretation is not obvious but can be seen by writ-
1
In this section x
i
and y
i
are column vectors, but be forewarned. In all other
sections x
i
and y
i
are row vectors.
ing out the explicit dot products of PX.
PX =
p
1
.
.
.
p
m
x
1
··· x
n
Y =
p
1
·x
1
··· p
1
·x
n
.
.
.
.
.
.
.
.
.
p
m
·x
1
··· p
m
·x
n
We can note the form of each column of Y.
y
i
=
p
1
·x
i
.
.
.
p
m
·x
i
We recognize that each coefficient of y
i
is a dot-product of
x
i
with the corresponding row in P. In other words, the j
th
coefficient of y
i
is a projection on to the j
th
row of P. This is
in fact the very form of an equation where y
i
is a projection
on to the basis of
{
p
1
,...,p
m
}
. Therefore, the rows of P are a
new set of basis vectors for representing of columns of X.
C. Questions Remaining
By assuming linearity the problem reduces to finding the ap-
propriate change of basis. The row vectors
{
p
1
,...,p
m
}
in
this transformation will become the principal components of
X. Several questions now arise.
• What is the best way to re-express X?
• What is a good choice of basis P?
These questions must be answered by next asking ourselves
what features we would like Y to exhibit. Evidently, addi-
tional assumptions beyond linearity are required to arrive at
a reasonable result. The selection of these assumptions is the
subject of the next section.
IV. VARIANCE AND THE GOAL
Now comes the most important question: what does best ex-
press the data mean? This section will build up an intuitive
answer to this question and along the way tack on additional
assumptions.
A. Noise and Rotation
Measurement noise in any data set must be low or else, no
matter the analysis technique, no information about a signal