R语言入门指南：简单统计学与软件包simpleR

需积分: 9 174 浏览量更新于2024-07-19 收藏 2.13MB PDF 举报

"《简单R：用于入门统计学的R语言指南》是John Verzani撰写的一本面向初学者的统计学教材辅助资料。该书旨在配合如Kitchens的《探索统计学》这样的基础教材，其目标并非详尽展示R的所有功能，也不是替代标准教科书，而是通过一个学期的课程，引导读者了解R在统计学入门课程中可以学习到的关键特性。书中特别指出，这些笔记编写时考虑了R版本1.5.0或更高版本。作者选择使用等号（=）作为赋值运算符，而非传统的箭头组合<->，这是R在1.4.0版本中新增的功能。如果读者使用的R版本较旧，可能需要对部分内容进行调整。书中涉及的数据和函数在使用前需要安装。对于Windows用户，这通常涉及下载特定的“zip”文件，并按照操作系统提供的指示进行安装。具体的步骤会因系统的不同而有所差异。此外，作者提醒读者在开始阅读前确保所需的参考资料已经准备就绪，以避免在学习过程中遇到不必要的困扰。《简单R：用于入门统计学》是一本实用性强、易于理解的辅助材料，适合那些希望通过R语言入门统计学的初学者，它不仅提供了理论知识，还展示了如何将理论应用于实际数据分析中。通过这本书，读者可以掌握R的基本操作，以及如何利用R进行统计分析，从而为进一步深入学习R打下坚实的基础。"

Univariate Data page 12

The .25 and .75 quantiles are denoted the quartiles. The ﬁrst quartile is called Q

, and the third quartile is called

. (You’d think the second quartile would be called Q

, but use “the median” instead.) These values are in the R

function

RCodesummary. More generally, there is a quantile function which will compute any quantile between 0 and 1. To

ﬁnd the quantiles mentioned above we can do

> data=c(10, 17, 18, 25, 28, 28)

> summary(data)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.00 17.25 21.50 21.00 27.25 28.00

> quantile(data,.25)

25%

17.25

> quantile(data,c(.25,.75)) # two values of p at once

25% 75%

17.25 27.25

There is a historically popular set of alternatives to the quartiles, called the hinges that are somewhat easier to

compute by hand. The median is deﬁned as above. The lower hinge is then the median of all the data to the left of

the median, not counting this particular data point (if it is one.) The upper hinge is similarly deﬁned. For example,

if your data is again 10, 17, 18, 25, 28, 28, then the median is 21.5, and the lower hinge is the median of 10, 17,

18 (which is 17) and the upper hinge is the median of 25,28,28 which is 28. These are available in the function

fivenum(), and later appear in the boxplot function.

Here is an illustration with the sals data, which has n = 10. From above we should have the median at

(10+1)/2=5.5, the lower hinge at the 3rd value and the upper hinge at the 8th largest value. Whereas, the value of

should be at the 1 + (10 − 1)(1/4) = 3.25 value. We can check that this is the case by sorting the data

> sort(sals)

[1] 0.25 0.40 1.00 2.00 3.00 4.00 5.00 8.00 12.00 50.00

> fivenum(sals) # note 1 is the 3rd value, 8 the 8th.

[1] 0.25 1.00 3.50 8.00 50.00

> summary(sals) # note 3.25 value is 1/4 way between 1 and 2

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.250 1.250 3.500 8.565 7.250 50.000

Resistant measures of center and spread

The most used measures of center and spread are the mean and standard deviation due to their relationship with

the normal distribution, but they suﬀer when the data has long tails, or many outliers. Various measures of center

and spread have been developed to handle this. The median is just such a resistant measure. It is oblivious to a few

arbitrarily large values. That is, is you make a measurement mistake and get 1,000,000 for the largest value instead

of 10 the median will be indiﬀerent.

Other resistant measures are available. A common one for the center is the trimmed mean. This is useful if the

data has many outliers (like the CEO compensation, although better if the data is symmetric). We trim oﬀ a certain

percentage of the data from the top and the bottom and then take the average. To do this in R we need to tell the

mean() how much to trim.

> mean(sals,trim=1/10) # trim 1/10 off top and bottom

[1] 4.425

> mean(sals,trim=2/10)

[1] 3.833333

Notice as we trim more and more, the value of the mean gets closer to the median which is when trim=1/2. Again

notice how we used a named argument to the

mean function.

The variance and standard deviation are also sensitive to outliers. Resistant measures of spread include the IQR

and the mad.

The IQR or interquartile range is the diﬀerence of the 3rd and 1st quartile. The function

IQR calculates it for us

> IQR(sals)

[1] 6

simpleR – Using R for Introductory Statistics

Univariate Data page 13

The median average deviation (MAD) is also a useful, resistant measure of spread. It ﬁnds the median of the

absolute diﬀerences from the median and then multiplies by a constant. (Huh?) Here is a formula

median|X

− median(X)|(1.4826)

That is, ﬁnd the median, then ﬁnd all the diﬀerences from the median. Take the absolute value and then ﬁnd the

median of this new set of data. Finally, multiply by the constant. It is easier to do with

R than to describe.

> mad(sals)

[1] 4.15128

And to see that we could do this ourself, we would do

> median(abs(sals - median(sals))) # without normalizing constant

[1] 2.8

> median(abs(sals - median(sals))) * 1.4826

[1] 4.15128

(The choice of 1.4826 makes the value comparable with the standard deviation for the normal distribution.)

Stem-and-leaf Charts

There are a range of graphical summaries of data. If the data set is relatively small, the stem-and-leaf diagram is

very useful for seeing the shape of the distribution and the values. It takes a little getting used to. The number on

the left of the bar is the stem, the number on the right the digit. You put them together to ﬁnd the observation.

Suppose you have the box score of a basketball game and ﬁnd the following points per game for players on both

teams

2 3 16 23 14 12 4 13 2 0 0 0 6 28 31 14 4 8 2 5

To create a stem and leaf chart is simple

> scores = scan()

1: 2 3 16 23 14 12 4 13 2 0 0 0 6 28 31 14 4 8 2 5

21:

Read 20 items

> apropos("stem") # What exactly is the name?

[1] "stem" "system" "system.file" "system.time"

> stem(scores)

The decimal point is 1 digit(s) to the right of the |

0 | 000222344568

1 | 23446

2 | 38

3 | 1

R Basics: help, ? and apropos

Notice we use apropos() to help ﬁnd the name for the function. It is stem() and not stemleaf(). The

apropos() command is convenient when you think you know the function’s name but aren’t sure. The help command

will help us ﬁnd help on the given function or dataset once we know the name. For example help(stem) or the

abbreviated ?stem will display the documentation on the stem function.

Suppose we wanted to break up the categories into groups of 5. We can do so by setting the “scale”

> stem(scores,scale=2)

The decimal point is 1 digit(s) to the right of the |

0 | 000222344

0 | 568

1 | 2344

1 | 6

2 | 3

2 | 8

3 | 1

simpleR – Using R for Introductory Statistics

Univariate Data page 14

Example: Making numeric data categorical

Categorical variables can come from numeric variables by aggregating values. For example. The salaries could

be placed into broad categories of 0-1 million, 1-5 million and over 5 million. To do this using R one uses the cut()

function and the table() function.

Suppose the salaries are again

12 .4 5 2 50 8 3 1 4 .25

And we want to break that data into the intervals

[0, 1], (1, 5], (5, 50]

To use the cut command, we need to specify the cut points. In this case 0,1,5 and 50 (=max(sals)). Here is the

syntax

> sals = c(12, .4, 5, 2, 50, 8, 3, 1, 4, .25) # enter data

> cats = cut(sals,breaks=c(0,1,5,max(sals))) # specify the breaks

> cats # view the values

[1] (5,50] (0,1] (1,5] (1,5] (5,50] (5,50] (1,5] (0,1] (1,5] (0,1]

Levels: (0,1] (1,5] (5,50]

> table(cats) # organize

cats

(0,1] (1,5] (5,50]

3 4 3

> levels(cats) = c("poor","rich","rolling in it") # change labels

> table(cats)

cats

poor rich rolling in it

3 4 3

Notice, cut() answers the question “which interval is the number in?”. The output is the interval (as a factor).

This is why the table command is used to summarize the result of cut. Additionally, the names of the levels where

changed as an illustration of how to manipulate these.

Histograms

If there is too much data, or your audience doesn’t know how to read the stem-and-leaf, you might try other

summaries. The most common is similar to the bar plot and is a histogram. The histogram deﬁnes a sequence of

breaks and then counts the number of observation in the bins formed by the breaks. (This is identical to the features

of the cut() function.) It plots these with a bar similar to the bar chart, but the bars are touching. The height can

be the frequencies, or the proportions. In the latter case the areas sum to 1 – a property that will be sound familiar

when you study probability distributions. In either case the area is proportional to probability.

Let’s begin with a simple example. Suppose the top 25 ranked movies made the following gross receipts for a

week

29.6 28.2 19.6 13.7 13.0 7.8 3.4 2.0 1.9 1.0 0.7 0.4 0.4 0.3

0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Let’s visualize it (ﬁgure 3). First we scan it in then make some histograms

> x=scan()

1: 29.6 28.2 19.6 13.7 13.0 7.8 3.4 2.0 1.9 1.0 0.7 0.4 0.4 0.3 0.3

16: 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1

27:

Read 26 items

> hist(x) # frequencies

> hist(x,probability=TRUE) # proportions (or probabilities)

> rug(jitter(x)) # add tick marks

simpleR – Using R for Introductory Statistics

Univariate Data page 15

Histogram of x

0 10 25

0 5 10 15 20

Histogram of x

0 10 25

0.00 0.05 0.10 0.15

Figure 3: Histograms using frequencies and proportions

Two graphs are shown. The ﬁrst is the default graph which makes a histogram of frequencies (total counts). The

second does a histogram of proportions which makes the total area add to 1. This is preferred as it relates better to

the concept of a probability density. Note the only diﬀerence is the scale on the y axis.

A nice addition to the histogram is to plot the points using the rug command. It was used above in the second

graph to give the tick marks just above the x-axis. If your data is discrete and has ties, then the rug(jitter(x))

command will give a little jitter to the x values to eliminate ties.

Notice these commands opened up a graph window. The graph window in R has few options available using the

mouse, but many using command line options. The GGobi (http://www.ggobi.org/) package has more but requires

an extra software installation.

The basic histogram has a predeﬁned set of break points for the bins. If you want, you can specify the number of

breaks or your own break points (ﬁgure 4).

> hist(x,breaks=10) # 10 breaks, or just hist(x,10)

> hist(x,breaks=c(0,1,2,3,4,5,10,20,max(x))) # specify break points

many breaks

Density

0 5 15 25

0.0 0.1 0.2 0.3 0.4 0.5 0.6

few breaks

Density

0 5 15 25

0.00 0.05 0.10 0.15

Figure 4: Histograms with breakpoints speciﬁed

From the histogram, you can easily make guesses as to the values of the mean, the median, and the IQR. To do

so, you need to know that the median divides the histogram into two equal area pieces, the mean would be the point

where the histogram would balance if you tried to, and the IQR captures exactly the middle half of the data.

Boxplots

The boxplot (eg. ﬁgure 5) is used to summarize data succinctly, quickly displaying if the data is symmetric or has

suspected outliers. It is based on the 5-number summary. In its simplest usage, the boxplot has a box with lines at

the lower hinge (basically Q

), the Median, the upper hinge (basically Q

) and whiskers which extend to the min and

max. To showcase possible outliers, a convention is adopted to shorten the whiskers to a length of 1.5 times the box

length. Any points beyond that are plotted with points. These may further be marked diﬀerently if the data is more

Such data is available from movieweb.com (http://movieweb.com/movie/top25.html)

simpleR – Using

R for Introductory Statistics

Univariate Data page 16

0.0 0.2 0.4 0.6 0.8 1.0

MedianQ1 Q3Min

Q3 + 1.5*IQR Max

* Notice a skewed distirubtion

* notice presence of outliers

A typical boxplot

outliers

Figure 5: A typical boxplot

than 3 box lengths away. Thus the boxplots allows us to check quickly for symmetry (the shape looks unbalanced)

and outliers (lots of data points beyond the whiskers). In ﬁgure 5 we see a skewed distribution with a long tail.

Example: Movie sales, reading in a dataset

In this example, we look at data on movie revenues for the 25 biggest movies of a given week. Along the way,

we also introduce how to “read-in” a built-in data set. The data set here is from the data sets accompanying these

notes.

> library("Simple") # read in library for these notes

> data(movies) # read in data set for gross.

> names(movies)

[1] "title" "current" "previous" "gross"

> attach(movies) # to access the names above

> boxplot(current,main="current receipts",horizontal=TRUE)

> boxplot(gross,main="gross receipts",horizontal=TRUE)

> detach(movies) # tidy up

We plot both the current sales and the gross sales in a boxplot (ﬁgure 6).

Notice, both distributions are skewed, but the gross sales are less so. This shows why Hollywood is so interested

in the “big hit”, as a real big hit can generate a lot more revenue than quite a few medium sized hits.

R Basics: Reading in datasets with library and data

In the above example we read in a built-in dataset. Doing so is easy. Let’s see how to read in a dataset from

the package ts (time series functions). First we need to load the package, and then ask to load the data. Here is how

> library("ts") # load the library

> data("lynx") # load the data

> summary(lynx) # Just what is lynx?

Min. 1st Qu. Median Mean 3rd Qu. Max.

39.0 348.3 771.0 1538.0 2567.0 6991.0

The library and data command can be used in several diﬀerent ways

The data sets for these notes are available from the CSI math department (http://www.math.csi.cuny.edu/Statistics/R/simpleR)

and must be installed prior to this.

simpleR – Using

R for Introductory Statistics

剩余113页未读，继续阅读

zhoujb123

粉丝: 0

R语言入门指南：简单统计学与软件包simpleR

英文原版-Introductory Statistics Using SPSS 2nd Edition

Introductory statistics with R

Introductory Statistics with R

An Introduction to Statistical Learning with R

Simpler-Register-Model-Package-for-UVM-TB

simpler-robot:simple-robot是一个通用bot开发框架，以同一种灵活的标准来编写不同平台的bot应用。而simpler-robot便是simple-robot 2.x版本命名

simpler-support-tools-extension

simpler-npm-mysql

simpler-redux-show-state-change

simpler-robot-Kotlin资源

最新资源