Univariate Data page 12
The .25 and .75 quantiles are denoted the quartiles. The first quartile is called Q
1
, and the third quartile is called
Q
3
. (You’d think the second quartile would be called Q
2
, but use “the median” instead.) These values are in the R
function
RCodesummary. More generally, there is a quantile function which will compute any quantile between 0 and 1. To
find the quantiles mentioned above we can do
> data=c(10, 17, 18, 25, 28, 28)
> summary(data)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 17.25 21.50 21.00 27.25 28.00
> quantile(data,.25)
25%
17.25
> quantile(data,c(.25,.75)) # two values of p at once
25% 75%
17.25 27.25
There is a historically popular set of alternatives to the quartiles, called the hinges that are somewhat easier to
compute by hand. The median is defined as above. The lower hinge is then the median of all the data to the left of
the median, not counting this particular data point (if it is one.) The upper hinge is similarly defined. For example,
if your data is again 10, 17, 18, 25, 28, 28, then the median is 21.5, and the lower hinge is the median of 10, 17,
18 (which is 17) and the upper hinge is the median of 25,28,28 which is 28. These are available in the function
fivenum(), and later appear in the boxplot function.
Here is an illustration with the sals data, which has n = 10. From above we should have the median at
(10+1)/2=5.5, the lower hinge at the 3rd value and the upper hinge at the 8th largest value. Whereas, the value of
Q
1
should be at the 1 + (10 − 1)(1/4) = 3.25 value. We can check that this is the case by sorting the data
> sort(sals)
[1] 0.25 0.40 1.00 2.00 3.00 4.00 5.00 8.00 12.00 50.00
> fivenum(sals) # note 1 is the 3rd value, 8 the 8th.
[1] 0.25 1.00 3.50 8.00 50.00
> summary(sals) # note 3.25 value is 1/4 way between 1 and 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.250 1.250 3.500 8.565 7.250 50.000
Resistant measures of center and spread
The most used measures of center and spread are the mean and standard deviation due to their relationship with
the normal distribution, but they suffer when the data has long tails, or many outliers. Various measures of center
and spread have been developed to handle this. The median is just such a resistant measure. It is oblivious to a few
arbitrarily large values. That is, is you make a measurement mistake and get 1,000,000 for the largest value instead
of 10 the median will be indifferent.
Other resistant measures are available. A common one for the center is the trimmed mean. This is useful if the
data has many outliers (like the CEO compensation, although better if the data is symmetric). We trim off a certain
percentage of the data from the top and the bottom and then take the average. To do this in R we need to tell the
mean() how much to trim.
> mean(sals,trim=1/10) # trim 1/10 off top and bottom
[1] 4.425
> mean(sals,trim=2/10)
[1] 3.833333
Notice as we trim more and more, the value of the mean gets closer to the median which is when trim=1/2. Again
notice how we used a named argument to the
mean function.
The variance and standard deviation are also sensitive to outliers. Resistant measures of spread include the IQR
and the mad.
The IQR or interquartile range is the difference of the 3rd and 1st quartile. The function
IQR calculates it for us
> IQR(sals)
[1] 6
simpleR – Using R for Introductory Statistics