使用R扩展线性模型：电子书指南

5星 · 超过95%的资源需积分: 27 180 浏览量更新于2023-03-16 2 收藏 42.72MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"《Extending the Linear Model with R》是一本关于使用R语言扩展线性模型的书籍，涵盖了广义线性模型、混合效应模型和非参数回归模型等主题的第二版。书中提供了通过VitalSource® Bookshelf电子书平台进行阅读、做笔记和高亮标注的功能，并支持离线阅读和设备间同步。" 在数据分析领域，线性模型是一种基本而强大的工具，用于探索和理解变量之间的关系。然而，实际问题中往往存在非线性关系、分类响应变量或随机效应等因素，这就需要对线性模型进行扩展。《Extending the Linear Model with R》一书旨在帮助读者理解和应用这些扩展模型，特别是在使用R语言时。首先，广义线性模型（Generalized Linear Models, GLMs）是对经典线性模型的拓展，它可以处理非正态分布的响应变量，如二项分布（对应二元响应变量，如成功/失败）和泊松分布（对应计数数据）。GLMs的关键在于引入了链接函数，将线性预测值与期望值关联起来，以适应各种分布类型。其次，混合效应模型（Mixed Effects Models, MLMs）考虑了数据的层次结构或相关性，例如在时间序列数据、重复测量或嵌套设计中。这种模型包括固定效应和随机效应，固定效应是研究者感兴趣的参数，而随机效应则反映了数据的内在变异。R中的`lme4`包是实现这类模型的常用工具。再者，非参数回归模型（Nonparametric Regression Models）放松了对模型形式的假设，允许函数形式随数据自适应。这包括局部线性回归、样条函数和核平滑方法等，它们在处理复杂依赖关系或未知趋势时非常有用。书中通过R语言的实例和代码演示了如何构建和分析这些模型。VitalSource® Bookshelf提供的电子书服务让读者可以方便地在线或离线阅读，且支持在不同设备间同步阅读进度、笔记和高亮，提高了学习和研究的效率。为了使用VitalSource Bookshelf，读者需要先创建账户或登录现有账户，然后输入提供的兑换码以获取电子书的在线访问权限。下载Bookshelf应用程序到个人电脑、移动设备或Kindle Fire上，登录账户后即可离线阅读电子书。这一特性使得学习资源更加便捷和灵活，无论何时何地，都能继续深入理解和应用《Extending the Linear Model with R》中介绍的统计方法。

资源详情

资源推荐

This page intentionally left blankThis page intentionally left blank

Chapter 1

Introduction

This book is about extending the linear model methodology using R statistical soft-

ware. Before setting off on this journey, it is worth reviewing both linear models and

R. We shall not attempt a detailed description of linear models; the reader is advised

to consult texts such as Faraway (2014) or Draper and Smith (1998). We do not in-

tend this as a self-contained introduction to R as this may be found in books such as

Dalgaard (2002) or Maindonald and Braun (2010) or from guides obtainable from

the R website. Even so, a reader unfamiliar with R should be able to follow the intent

of the analysis and learn a little R in the process without further preparation.

Let’s consider an example. The 2000 United States Presidential election gener-

ated much controversy, particularly in the state of Florida where there were some

difﬁculties with the voting machinery. In Meyer (2002), data on voting in the state of

Georgia is presented and analyzed.

Let’s take a look at this data using R. Please refer to Appendix B for details on

obtaining and installing R along with the necessary add-on packages and data for

running the examples in this text. In this book, we denote R commands with bold

text in a grey box. You should type this in at the command prompt: >. We start by

loading the data:

data(gavote, package="faraway")

The data command loads the particular dataset into R. The name of the dataset

is gavote and it is being loaded from the package faraway. If you get an error

message about a package not being found, it probably means you have not installed

the faraway package. Please check the Appendix.

An alternative means of making the data is to load the faraway package:

library(faraway)

This will make all the data and functions in this package available for this R session.

In R, the object containing the data is called a dataframe. We can obtain deﬁni-

tions of the variables and more information about the dataset using the help com-

mand:

help(gavote)

You can use the help command to learn more about any of the commands we use.

For example, to learn about the quantile command:

help(quantile)

If you do not already know or guess the name of the command you need, use:

help.search("quantiles")

to learn about all commands that refer to quantiles.

We can examine the contents of the dataframe simply by typing its name:

2 INTRODUCTION

gavote

equip econ perAA rural atlanta gore bush other votes ballots

APPLING LEVER poor 0.182 rural notAtlanta 2093 3940 66 6099 6617

ATKINSON LEVER poor 0.230 rural notAtlanta 821 1228 22 2071 2149

....

The output in this text is shown in typewriter font. I have deleted most of the

output to save space. This dataset is small enough to be comfortably examined in

its entirety. Sometimes, we simply want to look at the ﬁrst few cases. The head

command is useful for this:

head(gavote)

equip econ perAA rural atlanta gore bush other votes ballots

APPLING LEVER poor 0.182 rural notAtlanta 2093 3940 66 6099 6617

ATKINSON LEVER poor 0.230 rural notAtlanta 821 1228 22 2071 2149

BACON LEVER poor 0.131 rural notAtlanta 956 2010 29 2995 3347

BAKER OS-CC poor 0.476 rural notAtlanta 893 615 11 1519 1607

BALDWIN LEVER middle 0.359 rural notAtlanta 5893 6041 192 12126 12785

BANKS LEVER middle 0.024 rural notAtlanta 1220 3202 111 4533 4773

The cases in this dataset are the counties of Georgia and the variables are (in order)

the type of voting equipment used, the economic level of the county, the percentage

of African Americans, whether the county is rural or urban, whether the county is

part of the Atlanta metropolitan area, the number of voters for Al Gore, the number

of voters for George Bush, the number of voters for other candidates, the number of

votes cast, and ballots issued.

The str command is another useful way to examine an R object:

str(gavote)

’data.frame’: 159 obs. of 10 variables:

$ equip : Factor w/ 5 levels "LEVER","OS-CC",..: 1 1 1 2 1 1 2 3 3 2 ...

$ econ : Factor w/ 3 levels "middle","poor",..: 2 2 2 2 1 1 1 1 2 2 ...

$ perAA : num 0.182 0.23 0.131 0.476 0.359 0.024 0.079 0.079 0.282 0.107 ...

$ rural : Factor w/ 2 levels "rural","urban": 1 1 1 1 1 1 2 2 1 1 ...

$ atlanta: Factor w/ 2 levels "Atlanta","notAtlanta": 2 2 2 2 2 2 2 1 2 2 ...

$ gore : int 2093 821 956 893 5893 1220 3657 7508 2234 1640 ...

$ bush : int 3940 1228 2010 615 6041 3202 7925 14720 2381 2718 ...

$ other : int 66 22 29 11 192 111 520 552 46 52 ...

$ votes : int 6099 2071 2995 1519 12126 4533 12102 22780 4661 4410 ...

$ ballots: int 6617 2149 3347 1607 12785 4773 12522 23735 5741 4475 ...

We can see that some of the variables, such as the equipment type, are factors. Fac-

tor variables are categorical. Other variables are quantitative. The perAA variable is

continuous while the others are integer valued. We also see the sample size is 159.

A potential voter goes to the polling station where it is determined whether he

or she is registered to vote. If so, a ballot is issued. However, a vote is not recorded

if the person fails to vote for President, votes for more than one candidate or the

equipment fails to record the vote. For example, we can see that in Appling county,

6617 −6099 = 518 ballots did not result in votes for President. This is called the

undercount. The purpose of our analysis will be to determine what factors affect the

undercount. We will not attempt a full and conclusive analysis here because our main

purpose is to illustrate the use of linear models and R. We invite the reader to ﬁll in

some of the gaps in the analysis.

Initial Data Analysis: The ﬁrst stage in any data analysis should be an initial

graphical and numerical look at the data. A compact numerical overview is:

summary(gavote)

equip econ perAA rural atlanta

LEVER:74 middle:69 Min. :0.000 rural:117 Atlanta : 15

OS-CC:44 poor :72 1st Qu.:0.112 urban: 42 notAtlanta:144

OS-PC:22 rich :18 Median :0.233

PAPER: 2 Mean :0.243

PUNCH:17 3rd Qu.:0.348

Max. :0.765

gore bush other votes ballots

Min. : 249 Min. : 271 Min. : 5 Min. : 832 Min. : 881

1st Qu.: 1386 1st Qu.: 1804 1st Qu.: 30 1st Qu.: 3506 1st Qu.: 3694

Median : 2326 Median : 3597 Median : 86 Median : 6299 Median : 6712

Mean : 7020 Mean : 8929 Mean : 382 Mean : 16331 Mean : 16927

3rd Qu.: 4430 3rd Qu.: 7468 3rd Qu.: 210 3rd Qu.: 11846 3rd Qu.: 12251

Max. :154509 Max. :140494 Max. :7920 Max. :263211 Max. :280975

For the categorical variables, we get a count of the number of each type that

occurs. We notice, for example, that only two counties used a paper ballot. This

will make it difﬁcult to estimate the effect of this particular voting method on the

undercount. For the numerical variables, we have six summary statistics that are

sufﬁcient to get a rough idea of the distributions. In particular, we notice that the

number of ballots cast ranges over orders of magnitudes. This suggests that I should

consider the relative, rather than the absolute, undercount. I create this new relative

undercount variable, where we specify the variables using the dataframe$variable

syntax:

gavote$undercount <- (gavote$ballots-gavote$votes)/gavote$ballots

summary(gavote$undercount)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0278 0.0398 0.0438 0.0565 0.1880

We see that the undercount ranges from zero up to as much as 19%. The mean across

counties is 4.38%. Note that this is not the same thing as the overall relative under-

count which is:

with(gavote, sum(ballots-votes)/sum(ballots))

[1] 0.03518

We have used with to save the trouble of prefacing all the subsequent variables with

gavote$. Graphical summaries are also valuable in gaining an understanding of the

data. Considering just one variable at a time, histograms are a well-known way of

examining the distribution of a variable:

hist(gavote$undercount,main="Undercount",xlab="Percent Undercount")

The plot is shown in the left panel of Figure 1.1. A histogram is a fairly crude estimate

of the density of the variable that is sensitive to the choice of bins. A kernel density

estimate can be viewed as a smoother version of a histogram that is also a superior

estimate of the density. We have added a “rug” to our display that makes it possible

to discern the individual data points:

plot(density(gavote$undercount),main="Undercount")

rug(gavote$undercount)

We can see that the distribution is slightly skewed and that there are two outliers in

the right tail of the distribution. Such plots are invaluable in detecting mistakes or un-

usual points in the data. Categorical variables can also be graphically displayed. The

pie chart is a popular method. We demonstrate this on the types of voting equipment:

pie(table(gavote$equip),col=gray(0:4/4))

4 INTRODUCTION

Undercount

Percent Undercount

Frequency

0.00 0.05 0.10 0.15 0.20

0 10 20 30 40 50 60

0.00 0.05 0.10 0.15 0.20

0 5 10 15 20

Undercount

N = 159 Bandwidth = 0.006989

Density

Figure 1.1 Histogram of the undercount is shown on the left and a density estimate with a

data rug is shown on the right.

The plot is shown in the ﬁrst panel of Figure 1.2. I have used shades of grey for the

slices of the pie because this is a monochrome book. If you omit the col argument,

you will see a color plot by default. Of course, a color plot is usually preferable, but

bear in mind that some photocopying machines and many laser printers are black and

white only, so a good greyscale plot is still needed. Alternatively, the Pareto chart is

a bar plot with categories in descending order of frequency:

barplot(sort(table(gavote$equip),decreasing=TRUE),las=2)

The plot is shown in the second panel of Figure 1.2. The las=2 argument means that

the bar labels are printed vertically as opposed to horizontally, ensuring that there is

enough room for them to be seen. The Pareto chart (or just a bar plot) is superior to

the pie chart because lengths are easier to judge than angles.

Two-dimensional plots are also very helpful. A scatterplot is the obvious way to

depict two quantitative variables. Let’s see how the proportion voting for Gore relates

to the proportion of African Americans:

gavote$pergore <- gavote$gore/gavote$votes

plot(pergore ~ perAA, gavote, xlab="Proportion African American", ylab

,→ ="Proportion for Gore")

The ,→ character just indicates that the command ran over onto a second line. Don’t

type ,→ in R — just type the whole command on a single line without hitting return

until the end. The plot, seen in the ﬁrst panel of Figure 1.3, shows a strong correlation

between these variables. This is an ecological correlation because the data points are

aggregated across counties. The plot, in and of itself, does not prove that individual

African Americans were more likely to vote for Gore, although we know this to be

true from other sources. We could also compute the proportion of voters for Bush, but

this is, not surprisingly, strongly negatively correlated with the proportion of voters

for Gore. We do not need both variables as the one explains the other. We will use the

剩余410页未读，继续阅读

qq_31087763

粉丝: 0
资源: 1

会员权益专享

使用R扩展线性模型：电子书指南

[R.Books]Extending the Linear Model with R

Extending the linear model with R

linear model with R

Modeling Survival Data - Extending the Cox Model

Extending the Return Potential Model With a Descriptive

extending the linear models with r课后练习

extending pythonpath with paths

Kaleidoscope: Extending the Language: Control Flow

the input voltage of adc will be attenuated extending the range of measureme

non-broadcastable output operand with shape (21,1) doesn't match the broadcast shape (21,60)

用java解决问题。 Return the sum of the numbers in the array, except ignore sections of numbers starting with a 6 and extending to the next 7 (every 6 will be followed by at least one 7). Return 0 for no numbers.

The+default+superclass,+"jakarta.servlet.http.HttpServlet",+according+to+the+project's+Dynamic+Web

android dialog fragment

the+left+electrode+extending+vector+is+invalid

freeswitch lua

android ExpandableListView

Extending Point Transformers

mov eax, DWORD PTR [ebp+8] movzx eax, BYTE PTR [eax] mov BYTE PTR [ebp-1], al

会员权益专享

最新资源