R语言实战：回归与方差分析详解

需积分: 13 8 浏览量更新于2024-07-20 收藏 939KB PDF 举报

"《实用R语言回归与方差分析》由Julian J. Faraway撰写，出版于2002年。本书并非初级教材，而是针对已经具备基础统计理论知识和实践经验的读者。作者假定读者对统计推断有所了解，如参数估计、假设检验和置信区间，同时对数据分析有一定的基础。此外，线性代数和微积分的基础也必不可少。本书的核心内容聚焦于R语言在回归分析和方差分析（ANOVA）的实践应用。作者的目标是让读者掌握各种方法的实际运用，并理解何时选择何种技术。书中充斥着丰富的实例，旨在清晰地解释各种技术的使用方法，以及展示其实际效果。读者将通过阅读学习如何选择合适的统计工具来解决实际问题，包括但不限于： 1. 线性回归：如何在R中构建和解读线性模型，理解变量之间的关系及其预测能力。 2. 多元回归：处理多个自变量对因变量影响的研究，包括逐步回归、主成分回归等高级技术。 3. 逐步正则化：如何防止过拟合并优化模型复杂度，如岭回归和LASSO。 4. 方差分析：理解单因素和多因素设计的ANOVA，以及ANOVA后的后续检验和交互作用分析。 5. 误差模型和混杂效应：处理随机误差和非独立性数据的问题，如重复测量或分层设计。 6. 稳健性和假设检验：如何在数据不满足正态分布或方差齐性的情况下进行稳健的回归分析。通过本书，读者不仅能掌握R语言的统计分析功能，还能提升问题解决的能力，学会在面对实际问题时做出明智的统计决策。因此，这本书对于希望深入理解R在数据分析中的实际应用的专业人士，或者希望提升现有技能的R用户来说，是一本非常有价值的参考书。"

1.3. HISTORY

We have added the y



x (solid) line to the plot. Now a student scoring, say one standard deviation

above average on the midterm might reasonably expect to do equally well on the ﬁnal. We compute the

least squares regression ﬁt and plot the regression line (more on the details later). We also compute the

correlations.

> g <- lm(final ˜ midterm,stat500)

> abline(g$coef,lty=5)

> cor(stat500)

midterm final hw total

midterm 1.00000 0.545228 0.272058 0.84446

final 0.54523 1.000000 0.087338 0.77886

hw 0.27206 0.087338 1.000000 0.56443

total 0.84446 0.778863 0.564429 1.00000

We see that the the student scoring 1 SD above average on the midterm is predicted to score somewhat

less above average on the ﬁnal (see the dotted regression line) - 0.54523 SD’s above average to be exact.

Correspondingly, a student scoring below average on the midterm might expect to do relatively better in the

ﬁnal although still below average.

If exams managed to measure the ability of students perfectly, then provided that ability remained un-

changed from midterm to ﬁnal, we would expect to see a perfect correlation. Of course, it’s too much to

expect such a perfect exam and some variation is inevitably present. Furthermore, individual effort is not

constant. Getting a high score on the midterm can partly be attributed to skill but also a certain amount of

luck. One cannot rely on this luck to be maintained in the ﬁnal. Hence we see the “regression to mediocrity”.

Of course this applies to any







situation like this — an example is the so-called sophomore jinx

in sports when a rookie star has a so-so second season after a great ﬁrst year. Although in the father-son

example, it does predict that successive descendants will come closer to the mean, it does not imply the

same of the population in general since random ﬂuctuations will maintain the variation. In many other

applications of regression, the regression effect is not of interest so it is unfortunate that we are now stuck

with this rather misleading name.

Regression methodology developed rapidly with the advent of high-speed computing. Just ﬁtting a

regression model used to require extensive hand calculation. As computing hardware has improved, then

the scope for analysis has widened.

2.3. MATRIX REPRESENTATION

is linear but





is not. Some relationships can be transformed to linearity — for example y



ε can be linearized by

taking logs. Linear models seem rather restrictive but because the predictors can transformed and combined

in any way, they are actually very ﬂexible. Truly non-linear models are rarely absolutely necessary and most

often arise from a theory about the relationships between the variables rather than an empirical investigation.

2.3 Matrix Representation

Given the actual data, we may write









but the use of subscripts becomes inconvenient and conceptually obscure. We will ﬁnd it simpler both

notationally and theoretically to use a matrix/vector representation. The regression equation is written as



Xβ



where y









, ε









, β









and









1 x

 

1 x







The column of ones incorporates the intercept term. A couple of examples of using this notation are the

simple no predictor, mean only model y



































We can assume that Eε



0 since if this were not so, we could simply absorb the non-zero expectation for

the error into the mean µ to get a zero expectation. For the two sample problem with a treatment group

having the response y



with mean µ

and control group having response z

 

with mean µ

have























1 0



1 0

0 1

 

0 1





























2.4 Estimating β

We have the regression equation y



Xβ



ε - what estimate of β would best separate the systematic com-

ponent Xβ from the random component ε. Geometrically speaking, y



while β



where p is the

number of parameters (if we include the intercept then p is the number of predictors plus one).

剩余212页未读，继续阅读

qq_21561293

粉丝: 0
资源: 1

R语言实战：回归与方差分析详解

Regression Modeling with Actuarial and Financial

Practical Regression.pdf

R语言教程~上海示范大学

How does the parameter alpha in lasso regression and ridge regression affect the results?

GAM regression using ggplot2

吴恩达supervised machine learning: regression and classification作业

Implement linear regression model and use autograd to optimize it by Pytorch.

LinearRegression()求r方

Apply the multiple linear regression model for the dataset rotifer in R

logistics regression python

最新资源