8 Chapter 1 Aspects
of
Multivariate Analysis
or
the average product
of
the
deviations from their respective means.
If
large values for
one
variable are observed in conjunction with large values
for
the other variable, and
the small values also occur together,
sl2
will be positive.
If
large
values from
one
vari-
able occur with small values for the
other
variable,
Sl2
will
be
negative.
If
there
is
no
particular association between the values for the two variables,
Sl2
will
be approxi-
mately zero.
The
sample
covariance
1 n _
~
Sik
=
-:L
(Xji
-
Xi)(Xjk
- Xk) i =
1,2,
...
,p,
k = 1,2,
...
,p
(1-4)
n
j=l
measures
the
association between
the
·ith and
kth
variables.
We
note
that
the
covari-
ance reduces
to
the sample variance when i = k. Moreover,
Sik
=
Ski
for all i and k ..
The final descriptive statistic considered here is the sample
correlation
coeffi-
cient
(or Pearson's
product-moment
correlation
coefficient, see [14]). This measure
of
the
linear association between two variables does not depend on the units of
measurement. The sample correlation coefficient for the ith
and
kth
variables
is
defined as
n
:L
(Xji
- x;) (Xjk -
Xk)
j=l
for
i =
1,2,
...
, p and k =
1,2,
...
,
p.
Note
rik =
rki
for all i and
k.
(1-5)
The
sample correlation coefficient is a standardized version
of
the sample co-
variance, where the product
of
the square roots
of
the sample variances provides
the
standardization. Notice
that
rik
has
the
same value whether n
or
n - 1
is
chosen as
the
common divisor for
Sii,
sa,
and Sik'
The sample correlation coefficient rik can also
be
viewed
as
a sample co variance.
Suppose
the
original values
'Xji
and Xjk are replaced
by
standardized values
(Xji
-
xi)/~and(xjk
-
xk)/~.Thestandardizedvaluesarecommensurablebe
cause both sets are centered
at
zero and expressed in standard deviation
units.
The sam-
ple correlation coefficient
is just the sample covariance of the standardized observations.
Although
the
signs
of
the
sample correlation and the sample covariance
are
the
same, the correlation is ordinarily easier to interpret because its magnitude is
bounded. To summarize,
the
sample correlation r has the following properties:
1. The value
of
r must
be
between
-1
and
+ 1 inclusive.
2.
Here
r measures
the
strength
of
the
linear association.
If
r = 0, this implies a
lack of
linear
association between
the
components. Otherwise,
the
sign
of
r indi-
cates
the
direction
of
the
association: r < 0 implies a tendency for one value in
the pair
to
be
larger
than
its average when
the
other
is
smaller than its average;
and
r > 0 implies a tendency for
one
value
of
the pair to
be
large when
the
other
value is large
and
also for
both
values
to
be
small together.
3.
The
value
of
rik remains unchanged
if
the measurements
of
the ith variable
are
changed to Yji =
aXji
+ b, j =
1,2,
...
, n, and the values
of
the
kth
vari-
able are changed
to
Yjk =
CXjk
+
d,
j
==
1,2,
...
,
n,
provided
that
the
con-
stants
a
and
c have
the
same
sign.
f
if
j
The Organization of Data, 9
The
~u~ntities
Sik
and
rik do not, in general, convey all there is
to
know
about
the
aSSOCIatIOn
between two variables. Nonlinear associations can exist that
are
not
revealed
.by
these
~es~riptive
statistics. Covariance
and
corr'elation provide
mea-
sures
of
lmear
aSSOCIatIOn,
or
association along a line. Their values
are
less informa-
tive
~~r
other
kinds
of
association.
On
the
other
hand, these quantities can
be
very
sensIttve to "wild" observations ("outIiers") and may indicate association when in
fact, little exists.
In spite
of
these shortcomings, covariance and correlation coeffi-
cien~s
are routi':lel.y calculated and analyzed. They provide cogent numerical sum-
man~s
~f
aSSOCIatIOn
~hen
the
data
do
not
exhibit obvious nonlinear patterns
of
aSSOCIation
and when
WIld
observations
are
not present.
. Suspect observa.tions
must
be
accounted for by correcting obvious recording
mIstakes and by takmg actions consistent with the identified causes.
The
values
of
Sik
and rik should be quoted
both
with
and
without these observations.
The sum
of
squares
of
the
deviations from the
mean
and
the
sum
of
cross-
product deviations are often
of
interest themselves. These quantities
are
and
n
n
Wkk
=
2:
(Xjk - Xk)2
j=I
Wik
=
2:
(Xji
- x;)
(Xjk
-
Xk)
j=l
k = 1,2,
...
,p
(1-6)
i = 1,2,
...
,p,
k = 1,2,
...
,p
(1-7)
The descriptive statistics computed from n measurements on p variables can
also
be
organized into arrays.
Arrays
of
Basic
Descriptive
Statistics
Sample means
i~m
[u
Sl2
'"
]
Sample variances
Sn
=
S~l
S22
S2p
(1-8)
and covariances
Spl
sp2
spp
R
~
l~'
r12
'"
]
Sample correlations
1
r2p
'pI
'p2
1