大数据时代统计学：ANOVA设计与线性回归模型分析

需积分: 7 151 浏览量更新于2024-07-17 收藏 4.99MB PDF 举报

"《方差分析设计与线性回归建模第二版》是一本由Ronald Christensen撰写，针对未平衡数据的分析工具，专为理解大数据时代统计学中的关键概念而编撰。随着电子革命的加速，数据的获取能力呈指数级增长，如今我们面临的主要挑战不再是数据的稀缺，而是如何从海量数据中筛选出有价值的信息。这通常涉及到两种方法：一是对大量数据进行粗略分析，二是对经过挑选的有意义的中等规模数据进行细致分析。由于处理大规模数据的精细度难以与处理少量数据相比，"粗略"在这里并非贬义，而是现实的考量。本书的章节涵盖了概率理论、随机变量及其期望值、方差和相关性等基础知识。作者首先介绍概率的基本概念，包括预期值和方差的重要性，以及Chebyshev不等式的应用，它提供了对随机变量分布不确定性的量化估计。接着，讨论了协方差和相关系数，它们是衡量两个随机变量间关系的关键统计量，对于理解数据集内部的结构至关重要。在连续分布部分，读者可以学习到正态分布、均匀分布等常见分布的理论和应用。此外，书中还专门探讨了二项分布，这是一个离散概率分布，广泛用于描述成功与失败事件的组合，如伯努利试验的扩展。Poisson抽样和二项分布的联系也在此处被深入剖析。多元情况下的统计模型，如多元正态分布和 multinomial分布，同样在书中占有重要地位。独立的泊松分布和多元多于实验（Multinomial distribution）的性质是理解复杂实验设计和分类问题的基础。通过阅读这本书，读者不仅能掌握方差分析设计和线性回归模型的具体方法，还能了解到如何在大数据背景下运用这些工具来挖掘和解释数据背后的意义。对于数据分析师、统计学家以及对大数据分析感兴趣的读者来说，这是一本实用且深入的参考书籍，帮助他们在这个信息爆炸的时代更有效地工作。"

The analysis of covariance chapter no longer includes an extensive discussion of how the covari-

ates must be chosen to maintain a valid experiment. That discussion has been moved to the chapter

Basic Experimental Designs. Tukey’s one degree of freedom test for nonadditivity is presented as

a test for the need to perform a power transformation rather than as a test for a particular type of

interaction. Tukey’s test is now part of the Model Checking chapter, not the ACOVA chapter.

The chapter on confounding and fractional replication has more discussion of analyzing such

data than many other books contain.

Acknowledgements

Many people provided comments that helped in writing this book. My colleagues Ed Bedrick,

Aparna Huzurbazar, Wes Johnson, Bert Koopmans, Frank Martin, Tim O’Brien, and Cliff Qualls

helped a lot. I got numerous valuable comments from my students at the University of New Mex-

ico. Marjorie Bond, Matt Cooney, Jeff S. Davis, Barbara Evans, Mike Fugate, Jan Mines, and Jim

Shields stand out in this regard. The book had several anonymous reviewers, some of whom made

excellent suggestions.

I would like to thank Martin Gilchrist and Springer-Verlag for permission to reproduce Exam-

ple 7.6.1 from Plane Answers to Complex Questions: The Theory of Linear Models. I also thank

the Biometrika Trustees for permission to use the tables in Appendix B.5. Professor John Deely and

the University of Canterbury in New Zealand were kind enough to support completion of the book

during my sabbatical there.

Now my only question is what to do with the chapters on quality control, p

factorials, and

response surfaces that ended up on the cutting room ﬂoor. I have pretty much given up on publishing

the quality control material. Response surfaces got into Advanced Linear Modeling (ALM) and I’m

hoping to get p

factorials into a new edition of ALM.

Ronald Christensen

Albuquerque, New Mexico

February 1996

Edited, October 2014

Computing

There are two aspects to computing: generating output and interpreting output. We cannot always

control the generation of output, so we need to be able to interpret a variety of outputs. The book

places great emphasis on interpreting the range of output that one might encounter when dealing

with the data structures in the book. This comes up most forcefully when dealing with multiple

categorical predictors because arbitrary choices must be made by computer programmers to produce

some output, e.g., parameter estimates. The book deals with the arbitrary choices that are most

commonly made. Methods for generating output have, for the most part, been removed from the

book and placed on my website.

R has taken over the Statistics computing world. While R code is in the book, illustrations

of all the analyses and all of the graphics have been performed in R and are available on

my website: www.stat.unm.edu/∼fletcher. Also, substantial bodies of Minitab and SAS code

(particularly for SAS’s GENMOD and LOGISTIC procedures) are available on my website. While

Minitab and many versions of SAS are now menu driven, the menus essentially write the code for

running a procedure. Presenting the code provides the information needed by the programs and,

implicitly, the information needed in the menus. That information is largely the same regardless of

the program. The choices of R, Minitab, and SAS are not meant to denigrate any other software.

They are merely what I am most familiar with.

The online computing aids are chapter for chapter (and for the most part, section for section)

images of the book. Thus, if you want help computing something from Section 2.5 of the book, look

in Section 2.5 of the online material.

My strong personal preference is for doing whatever I can in Minitab. That is largely because

Minitab forces me to remember fewer arcane commands than any other system (that I am familiar

with). Data analysis output from Minitab is discussed in the book because it differs from the output

provided by R and SAS. For ﬁtting large tables of counts, as discussed in Chapter 21, I highly

recommend the program BMDP 4F. Fortunately, this can now be accessed through some batch

versions of SAS. My website contains ﬁles for virtually all the data. But you need to compare

each ﬁle to the tabled data and not just assume that the ﬁle looks exactly like the table.

Finally, I would like to point out a notational issue. In both Minitab and SAS, “glm” refers

to ﬁtting general linear models. In R, “glm” refers to ﬁtting generalized linear models, which are

something different. Generalized linear models contain general linear models as a special case. The

models in Chapters 20, 21, and 22 are different special cases of generalized linear models. (I am not

convinced that generalized linear models are anything more than a series of special cases connected

by a remarkable computing trick, cf. Christensen, 1997, Chapter 9.)

BMDP Statistical Software was located at 1440 Sepulveda Boulevard, Los Angeles, CA 90025.

MINITAB is a registered trademark of Minitab, Inc., 3081 Enterprise Drive, State College, PA

16801, telephone: (814) 238-3280, telex: 881612.

Chapter 1

Introduction

Statistics has two roles in society. First, Statistics is in the business of creating stereotypes. Think of

any stereotype you like, but to keep me out of trouble let’s consider something innocuous, like the

hypothesis that Italians talk with their hands more than Scandinavians. To establish the stereotype,

you need to collect data and use it to draw a conclusion. Often the conclusion is that either the

data suggest a difference or that they do not. The conclusion is (almost) never whether a difference

actually exists, only whether or not the data suggest a difference and how strongly they suggest it.

Statistics has been ﬁlling this role in society for at least 100 years.

Statistics’ less recognized second role in society is debunking stereotypes. Statistics is about

appreciating variability. It is about understanding variability, explaining it, and controlling it. I ex-

pect that with enough data, one could show that, on average, Italians really do talk with their hands

more than Scandinavians. Collecting a lot of data helps control the relevant variability and allows

us to draw a conclusion. But I also expect that we will never be able to predict accurately whether a

random Italian will talk with their hands more than a random Scandinavian. There is too much vari-

ability among humans. Even when differences among groups exist, those differences often pale in

comparison to the variability displayed by individuals within the groups—to the point where group

differences are often meaningless when dealing with individuals. For statements about individuals,

collecting a lot of data only helps us to more accurately state the limits of our (very considerable)

uncertainty.

Ultimately, Statistics is about what you can conclude and, equally, what you cannot conclude

from analyzing data that are subject to variability, as all data are. Statisticians use ideas from prob-

ability to quantify variability. They typically analyze data by creating probability models for the

data.

In this chapter we introduce basic ideas of probability and some related mathematical concepts

that are used in Statistics. Values to be analyzed statistically are generally thought of as random

variables; these are numbers that result from random events. The mean (average) value of a pop-

ulation is deﬁned in terms of the expected value of a random variable. The variance is introduced

as a measure of the variability in a random variable (population). We also introduce some special

distributions (populations) that are useful in modeling statistical data. The purpose of this chapter is

to introduce these ideas, so they can be used in analyzing data and in discussing statistical models.

In writing statistical models, we often use symbols from the Greek alphabet. A table of these

symbols is provided in Appendix B.6.

Rumor has it that there are some students studying Statistics who have an aversion to mathemat-

ics. Such people might be wise to focus on the concepts of this chapter and not let themselves get

bogged down in the details. The details are given to provide a more complete introduction for those

students who are not math averse.

1.1 Probability

Probabilities are numbers between zero and one that are used to explain random phenomena. We are

all familiar with simple probability models. Flip a standard coin; the probability of heads is 1/2. Roll

2 1. INTRODUCTION

a die; the probability of getting a three is 1/6. Select a card from a well-shufﬂed deck; the probability

of getting the queen of spades is 1/52 (assuming there are no jokers). One way to view probability

models that many people ﬁnd intuitive is in terms of random sampling from a ﬁxed population.

For example, the 52 cards form a ﬁxed population and picking a card from a well-shufﬂed deck is

a means of randomly selecting one element of the population. While we will exploit this idea of

sampling from ﬁxed populations, we should also note its limitations. For example, blood pressure is

a very useful medical indicator, but even with a ﬁxed population of people it would be very difﬁcult

to deﬁne a useful population of blood pressures. Blood pressure depends on the time of day, recent

diet, current emotional state, the technique of the person taking the reading, and many other factors.

Thinking about populations is very useful, but the concept can be very limiting both practically and

mathematically. For measurements such as blood pressures and heights, there are difﬁculties in even

specifying populations mathematically.

For mathematical reasons, probabilities are deﬁned not on particular outcomes but on sets of

outcomes (events). This is done so that continuous measurements can be dealt with. It seems much

more natural to deﬁne probabilities on outcomes as we did in the previous paragraph, but consider

some of the problems with doing that. For example, consider the problem of measuring the height of

a corpse being kept in a morgue under controlled conditions. The only reason for getting morbid here

is to have some hope of deﬁning what the height is. Living people, to some extent, stretch and con-

tract, so a height is a nebulous thing. But even given that someone has a ﬁxed height, we can never

know what it is. When someone’s height is measured as 177.8 centimeters (5 feet 10 inches), their

height is not really 177.8 centimeters, but (hopefully) somewhere between 177.75 and 177.85 cen-

timeters. There is really no chance that anyone’s height is exactly 177.8 cm, or exactly 177.8001 cm,

or exactly 177.800000001 cm, or exactly 56.5955

cm, or exactly (76

√

5 + 4.5

√

3) cm. In any

neighborhood of 177.8, there are more numerical values than one could even imagine counting. The

height should be somewhere in the neighborhood, but it won’t be the particular value 177.8. The

point is simply that trying to specify all the possible heights and their probabilities is a hopeless

exercise. It simply cannot be done.

Even though individual heights cannot be measured exactly, when looking at a population of

heights they follow certain patterns. There are not too many people over 8 feet (244 cm) tall. There

are lots of males between 175.3 cm and 177.8 cm (5





and 5





). With continuous values, each

possible outcome has no chance of occurring, but outcomes do occur and occur with regularity. If

probabilities are deﬁned for sets instead of outcomes, these regularities can be reproduced mathe-

matically. Nonetheless, initially the best way to learn about probabilities is to think about outcomes

and their probabilities.

There are ﬁve key facts about probabilities:

1. Probabilities are between 0 and 1.

2. Something that happens with probability 1 is a sure thing.

3. If something has no chance of occurring, it has probability 0.

4. If something occurs with probability, say, .25, the probability that it will not occur is 1 −.25 =

.75.

5. If two events are mutually exclusive, i.e., if they cannot possibly happen at the same time, then

the probability that either of them occurs is just the sum of their individual probabilities.

Individual outcomes are always mutually exclusive, e.g., you cannot ﬂip a coin and get both heads

and tails, so probabilities for outcomes can always be added together. Just to be totally correct, I

should mention one other point. It may sound silly, but we need to assume that something occurring

is always a sure thing. If we ﬂip a coin, we must get either heads or tails with probability 1. We

could even allow for the coin landing on its edge as long as the probabilities for all the outcomes

add up to 1.

XAMPLE 1.1.1. Consider the nine outcomes that are all combinations of three heights, tall (T),

剩余605页未读，继续阅读

xavierjuan

粉丝: 0
资源: 7

大数据时代统计学：ANOVA设计与线性回归模型分析

y作m次多项式拟合的MATLAB代码-Regularized-Linear-Regression-and-Bias-v.s.-Variance

matlab如何用代码拟合幂函数-Regularized-Linear-Regression-and-Bias-v.s.-Variance:正

Convergence Rates of the Distributions of Error Variance Estimates in Linear Models" (1983年)

Variance Estimation and Bandwidth Selection for Kernel Regression

Design and Analysis of Experiments: Special Designs and Applications

A Bayesian analysis of a variance decomposition for stock returns

Variance Estimation and Smoothing-Parameter Selection for Spline Regression

【Bayesian Linear Regression Analysis】: Exploring the Principles and Applications of Bayesian ...

: Application of Principal Component Regression and Partial Least Squares Regression in Linear ...

【Mysteries of Residual Analysis】: Diagnostics and Solutions for Residuals in Linear Regression ...

最新资源