没有合适的资源?快使用搜索试试~ 我知道了~
首页R语言实战:回归与方差分析详解
"《实用R语言回归与方差分析》由Julian J. Faraway撰写,出版于2002年。本书并非初级教材,而是针对已经具备基础统计理论知识和实践经验的读者。作者假定读者对统计推断有所了解,如参数估计、假设检验和置信区间,同时对数据分析有一定的基础。此外,线性代数和微积分的基础也必不可少。 本书的核心内容聚焦于R语言在回归分析和方差分析(ANOVA)的实践应用。作者的目标是让读者掌握各种方法的实际运用,并理解何时选择何种技术。书中充斥着丰富的实例,旨在清晰地解释各种技术的使用方法,以及展示其实际效果。读者将通过阅读学习如何选择合适的统计工具来解决实际问题,包括但不限于: 1. 线性回归:如何在R中构建和解读线性模型,理解变量之间的关系及其预测能力。 2. 多元回归:处理多个自变量对因变量影响的研究,包括逐步回归、主成分回归等高级技术。 3. 逐步正则化:如何防止过拟合并优化模型复杂度,如岭回归和LASSO。 4. 方差分析:理解单因素和多因素设计的ANOVA,以及ANOVA后的后续检验和交互作用分析。 5. 误差模型和混杂效应:处理随机误差和非独立性数据的问题,如重复测量或分层设计。 6. 稳健性和假设检验:如何在数据不满足正态分布或方差齐性的情况下进行稳健的回归分析。 通过本书,读者不仅能掌握R语言的统计分析功能,还能提升问题解决的能力,学会在面对实际问题时做出明智的统计决策。因此,这本书对于希望深入理解R在数据分析中的实际应用的专业人士,或者希望提升现有技能的R用户来说,是一本非常有价值的参考书。"
资源详情
资源推荐
1.3. HISTORY
15
We have added the y
x (solid) line to the plot. Now a student scoring, say one standard deviation
above average on the midterm might reasonably expect to do equally well on the final. We compute the
least squares regression fit and plot the regression line (more on the details later). We also compute the
correlations.
> g <- lm(final ˜ midterm,stat500)
> abline(g$coef,lty=5)
> cor(stat500)
midterm final hw total
midterm 1.00000 0.545228 0.272058 0.84446
final 0.54523 1.000000 0.087338 0.77886
hw 0.27206 0.087338 1.000000 0.56443
total 0.84446 0.778863 0.564429 1.00000
We see that the the student scoring 1 SD above average on the midterm is predicted to score somewhat
less above average on the final (see the dotted regression line) - 0.54523 SD’s above average to be exact.
Correspondingly, a student scoring below average on the midterm might expect to do relatively better in the
final although still below average.
If exams managed to measure the ability of students perfectly, then provided that ability remained un-
changed from midterm to final, we would expect to see a perfect correlation. Of course, it’s too much to
expect such a perfect exam and some variation is inevitably present. Furthermore, individual effort is not
constant. Getting a high score on the midterm can partly be attributed to skill but also a certain amount of
luck. One cannot rely on this luck to be maintained in the final. Hence we see the “regression to mediocrity”.
Of course this applies to any
x
y
situation like this — an example is the so-called sophomore jinx
in sports when a rookie star has a so-so second season after a great first year. Although in the father-son
example, it does predict that successive descendants will come closer to the mean, it does not imply the
same of the population in general since random fluctuations will maintain the variation. In many other
applications of regression, the regression effect is not of interest so it is unfortunate that we are now stuck
with this rather misleading name.
Regression methodology developed rapidly with the advent of high-speed computing. Just fitting a
regression model used to require extensive hand calculation. As computing hardware has improved, then
the scope for analysis has widened.
Chapter 2
Estimation
2.1 Example
Let’s start with an example. Suppose that Y is the fuel consumption of a particular model of car in m.p.g.
Suppose that the predictors are
1. X
1
— the weight of the car
2. X
2
— the horse power
3. X
3
— the no. of cylinders.
X
3
is discrete but that’s OK. Using country of origin, say, as a predictor would not be possible within the
current development (we will see how to do this later in the course). Typically the data will be available in
the form of an array like this
y
1
x
11
x
12
x
13
y
2
x
21
x
22
x
23
y
n
x
n1
x
n2
x
n3
where n is the number of observations or cases in the dataset.
2.2 Linear Model
One very general form for the model would be
Y
f
X
1
X
2
X
3
ε
where f is some unknown function and ε is the error in this representation which is additive in this instance.
Since we usually don’t have enough data to try to estimate f directly, we usually have to assume that it has
some more restricted form, perhaps linear as in
Y
β
0
β
1
X
1
β
2
X
2
β
3
X
3
ε
where β
i
, i
0
1
2
3 are unknown parameters. β
0
is called the intercept term. Thus the problem is reduced
to the estimation of four values rather than the complicated infinite dimensional f.
In a linear model the parameters enter linearly — the predictors do not have to be linear. For example
Y
β
0
β
1
X
1
β
2
logX
2
ε
16
2.3. MATRIX REPRESENTATION
17
is linear but
Y
β
0
β
1
X
β
2
1
ε
is not. Some relationships can be transformed to linearity — for example y
β
0
x
β
1
ε can be linearized by
taking logs. Linear models seem rather restrictive but because the predictors can transformed and combined
in any way, they are actually very flexible. Truly non-linear models are rarely absolutely necessary and most
often arise from a theory about the relationships between the variables rather than an empirical investigation.
2.3 Matrix Representation
Given the actual data, we may write
y
i
β
0
β
1
x
1i
β
2
x
2i
β
3
x
3i
ε
i
i
1
n
but the use of subscripts becomes inconvenient and conceptually obscure. We will find it simpler both
notationally and theoretically to use a matrix/vector representation. The regression equation is written as
y
Xβ
ε
where y
y
1
y
n
T
, ε
ε
1
ε
n
T
, β
β
0
β
3
T
and
X
1 x
11
x
12
x
13
1 x
21
x
22
x
23
1 x
n1
x
n2
x
n3
The column of ones incorporates the intercept term. A couple of examples of using this notation are the
simple no predictor, mean only model y
µ
ε
y
1
y
n
1
1
µ
ε
1
ε
n
We can assume that Eε
0 since if this were not so, we could simply absorb the non-zero expectation for
the error into the mean µ to get a zero expectation. For the two sample problem with a treatment group
having the response y
1
y
m
with mean µ
y
and control group having response z
1
z
n
with mean µ
z
we
have
y
1
y
m
z
1
z
n
1 0
1 0
0 1
0 1
µ
y
µ
z
ε
1
ε
m
n
2.4 Estimating β
We have the regression equation y
Xβ
ε - what estimate of β would best separate the systematic com-
ponent Xβ from the random component ε. Geometrically speaking, y
IR
n
while β
IR
p
where p is the
number of parameters (if we include the intercept then p is the number of predictors plus one).
2.5. LEAST SQUARES ESTIMATION
18
Space spanned by X Fitted in p dimensions
y in n dimensions Residual in
n−p dimensions
Figure 2.1: Geometric representation of the estimation β. The data vector Y is projected orthogonally onto
the model space spanned by X. The fit is represented by projection ˆy
X
ˆ
β with the difference between the
fit and the data represented by the residual vector
ˆ
ε.
The problem is to find β such that Xβ is close to Y. The best choice of
ˆ
β is apparent in the geometrical
representation shown in Figure 2.4.
ˆ
β is in some sense the best estimate of β within the model space. The response predicted by the model
is ˆy
X
ˆ
β or Hy where H is an orthogonal projection matrix. The difference between the actual response
and the predicted response is denoted by
ˆ
ε — the residuals.
The conceptual purpose of the model is to represent, as accurately as possible, something complex — y
which is n-dimensional — in terms of something much simpler — the model which is p-dimensional. Thus
if our model is successful, the structure in the data should be captured in those p dimensions, leaving just
random variation in the residuals which lie in an n
p dimensional space. We have
Data
Systematic Structure
Random Variation
n dimensions
p dimensions
n
p
dimensions
2.5 Least squares estimation
The estimation of βcan be considered from a non-geometric point of view. We might define the best estimate
of βas that which minimizes the sum of the squared errors, ε
T
ε. That is to say that the least squares estimate
of β, called
ˆ
β minimizes
∑
ε
2
i
ε
T
ε
y
Xβ
T
y
Xβ
Expanding this out, we get
y
T
y
2βX
T
y
β
T
X
T
Xβ
Differentiating with respect to β and setting to zero, we find that
ˆ
β satisfies
X
T
X
ˆ
β
X
T
y
These are called the normal equations. We can derive the same result using the geometric approach. Now
provided X
T
X is invertible
ˆ
β
X
T
X
1
X
T
y
X
ˆ
β
X
X
T
X
1
X
T
y
Hy
2.6. EXAMPLES OF CALCULATING
ˆ
β 19
H
X
X
T
X
1
X
T
is called the “hat-matrix” and is the orthogonal projection of y onto the space spanned
by X. H is useful for theoretical manipulations but you usually don’t want to compute it explicitly as it is an
n
n matrix.
Predicted values: ˆy
Hy
X
ˆ
β.
Residuals:
ˆ
ε
y
X
ˆ
β
y
ˆy
I
H
y
Residual sum of squares:
ˆ
ε
T
ˆ
ε
y
T
I
H
I
H
y
y
T
I
H
y
Later we will show that the least squares estimate is the best possible estimate of β when the errors ε are
uncorrelated and have equal variance - i.e. var ε
σ
2
I.
2.6 Examples of calculating
ˆ
β
1. When y
µ
ε, X
1 and β
µ so X
T
X
1
T
1
n so
ˆ
β
X
T
X
1
X
T
y
1
n
1
T
y
¯y
2. Simple linear regression (one predictor)
y
i
α
βx
i
ε
i
y
1
y
n
1 x
1
1 x
n
α
β
ε
1
ε
n
We can now apply the formula but a simpler approach is to rewrite the equation as
y
i
α
α
β¯x
β
x
i
¯x
ε
i
so now
X
1 x
1
¯x
1 x
n
¯x
X
T
X
n 0
0
∑
n
i
1
x
i
¯x
2
Now work through the rest of the calculation to reconstruct the familiar estimates, i.e.
ˆ
β
∑
x
i
¯x
y
i
∑
x
i
¯x
2
In higher dimensions, it is usually not possible to find such explicit formulae for the parameter estimates
unless X
T
X happens to be a simple form.
2.7 Why is
ˆ
β a good estimate?
1. It results from an orthogonal projection onto the model space. It makes sense geometrically.
2. If the errors are independent and identically normally distributed, it is the maximum likelihood esti-
mator. Loosely put, the maximum likelihood estimate is the value of β that maximizes the probability
of the data that was observed.
3. The Gauss-Markov theorem states that it is best linear unbiased estimate. (BLUE).
剩余212页未读,继续阅读
qq_21561293
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 前端面试必问:真实项目经验大揭秘
- 永磁同步电机二阶自抗扰神经网络控制技术与实践
- 基于HAL库的LoRa通讯与SHT30温湿度测量项目
- avaWeb-mast推荐系统开发实战指南
- 慧鱼SolidWorks零件模型库:设计与创新的强大工具
- MATLAB实现稀疏傅里叶变换(SFFT)代码及测试
- ChatGPT联网模式亮相,体验智能压缩技术.zip
- 掌握进程保护的HOOK API技术
- 基于.Net的日用品网站开发:设计、实现与分析
- MyBatis-Spring 1.3.2版本下载指南
- 开源全能媒体播放器:小戴媒体播放器2 5.1-3
- 华为eNSP参考文档:DHCP与VRP操作指南
- SpringMyBatis实现疫苗接种预约系统
- VHDL实现倒车雷达系统源码免费提供
- 掌握软件测评师考试要点:历年真题解析
- 轻松下载微信视频号内容的新工具介绍
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功