统计分析与数据挖掘实战指南

需积分: 9 20 浏览量更新于2024-07-17 收藏 41.49MB PDF 举报

"Handbook of Statistical Analysis and Data Mining Applications - 高清英文版，经典著作，适合新老数据挖掘者" 本书《统计分析与数据挖掘应用手册》是一部深度探讨统计分析和数据挖掘领域的权威指南，旨在提供实用的洞见和方法，帮助读者在实际工作中有效地挖掘数据的价值。作者通过丰富的实例、教程和额外的DVD材料，为读者提供了全面的学习资源。统计分析是理解和解释数据的基础，而数据挖掘则是从海量数据中发现有价值信息的关键技术。H.G. Wells的引言强调了统计思维在现代社会中的重要性，如同阅读和写作对于公民素养一样必不可少。如今，随着全球化商业和组织对数据分析需求的增长，统计分析和数据挖掘已经成为制定精准模型，预测未来趋势，进而做出明智战略决策的核心工具。书中内容覆盖了统计分析的各个方面，包括但不限于描述性统计、推断性统计、假设检验、回归分析、时间序列分析等基础概念。同时，数据挖掘部分涵盖了关联规则学习、聚类分析、分类算法（如决策树、随机森林、支持向量机等）、特征选择以及预处理和后处理技术。此外，书中还介绍了如何将这些理论知识应用于实际问题，如市场细分、客户行为预测、风险管理等。对于新手，本书提供了一个全面的入门平台，引导他们逐步了解数据挖掘的过程，包括数据收集、预处理、建模和结果解释。对于有经验的数据挖掘者，书中的深入讨论和实践建议可以帮助他们提升分析技能，解决更复杂的问题。书中特别强调了统计分析的艺术性和科学性结合，即直觉和专业知识的融合，以确保模型的准确性和预测的可靠性。作者提倡的“God of Data Mining”理念，强调高效的数据分析能力对于当今社会的重要性，正如阅读和写作能力一样，是不可或缺的基本素质。《统计分析与数据挖掘应用手册》是一本集理论与实践于一体的综合教材，无论你是初入此领域的学习者，还是寻求专业提升的从业者，都能从中获益匪浅，掌握在大数据时代中进行有效分析和决策的工具。

This page intentionally left blank

Foreword 1

This book will help the novice user become familiar with data mining. Basically, data

mining is doing data analysis (or statistics) on data sets (often large) that have been

obtained from potentially many sources. As such, the miner may not have control of the

input data, but must rely on sources that have gathered the data. As such, there are pro-

blems that every data miner must be aware of as he or she begins (or completes) a mining

operation. I strongly resonated to the material on “The Top 10 Data Mining Mistakes,”

which give a worthwhile checklist:

• Ensure

you have a response variable and predictor variables—and that they are correctly

measured.

• Beware of overfitting.

With scads of variables, it is easy with most statistical programs to

fit incredibly complex models, but they cannot be reproduced. It is good to save part of

the sample to use to test the model. Various methods are offered in this book.

• Don’t use only one method. Using only linear regression can be a problem. Try

dichotomizing the response or categorizing it to remove nonlinearities in the response

variable. Often, there are clusters of values at zero, which messes up any normality

assumption. This, of course, loses information, so you may want to categorize a

continuous response variable and use an alternative to regression. Similarly, predictor

variables may need to be treated as factors rather than linear predictors. A classic

example is using marital status or race as a linear predictor when there is no order.

• Asking the wrong question—when looking for a rare phenomenon, it may be helpful

to identify the most common pattern. These may lead to complex analyses, as in item 3,

but they may also be conceptually simple. Again, you may need to take care that you

don’t overfit the data.

• Don’t become enamor ed with the data. There may be a substantial history from earlier

data or from domain experts that can help with the modeling.

• Be wary of using an outcome variable (or one highly correlated with the outcome

variable) and becoming excited about the result. The predictors should be “proper”

predictors in the sense that (a) they are measured prior to the outcome and (b) are not a

function of the outcome.

• Do not discard outliers without solid justification. Just because an observation is out of

line with others is insufficient reason to igno re it. You must check the circumstances that

led to the value. In any event, it is useful to conduct the analysis with the observation(s)

included and exclu ded to determine the sensitivity of the results to the outlier.

• Extrapolating is a fine way to go broke—the best example is the stock market. Stick

within your data, and if you must go outside, put plenty of caveats. Better still, restrain

the impulse to extrapolate. Beware that pictures are often far too simple and we can be

misled. Political campaigns oversimplify complex problems (“My opponent wants to

raise taxes”; “My opponent will take us to war”) when the realities may imply we have

some infrastructure needs that can be handled only with new funding, or we have been

attacked by some bad guys.

Be wary of your data sources. If you are combining several sets of data, they need to

et a few standards:

• The

definitions of variables that are being merged should be identical. Often they are

close but not exact (especiall

y in meta-analysis where clinical studies may have

somewhat different definitions due to different medical institutions or laboratories).

• Be careful about missing values. Often when multiple data sets are merged, missing

values can be induced: one variable isn’t present in another data set, what you thought

was a unique variable name was slightly different in the two sets, so you end up with

two variables that both have a lot of missing values.

• How you handle missing values can be crucial. In one example, I used complete cases

and lost half of my sample—all variables had at least 85% completeness, but when put

together the sample lost half of the data. The residual sum of squares from a stepwise

regression was about 8. When I included more variables using mean replacement, almost

the same set of predictor variables surfaced, but the residual sum of squares was 20.

I then used multiple imputation and found approximately the same set of predictors but

had a residual sum of squares (median of 20 imputations) of 25. I find that mean

replacement is rather optimistic but surely better than relying on only complete cases.

If using stepwise regression, I find it useful to replicate it with a bootstrap or with

multiple imputation. However, with large data sets, this approach may be expensive

computationally.

To conclude, there is a wealth of material in this handbook that will repay study.

Peter

A. Lachenbruch, Ph.D.,

Ore

gon State University

Past President, 2008, American Statistical Society

Professor, Oregon State University

Formerly: FDA and professor at Johns Hopkins University;

UCLA, and University of Iowa, and

University of North Carolina Chapel Hill

xvi FOREWORD 1

Foreword 2

A November 2008 search on Amazon.com for “data mining” books yielded over 15,000

hits—including 72 to be published in 2009. Most of these books either describe data mining

in very technical and mathematical terms, beyond the reach of most individuals, or

approach data mining at an introductory level without sufficient detail to be useful to the

practitioner. The Handbook of Statistical Analysis and Data Mining Applications is the book that

strikes the right balance between these two treatments of data mining.

This volume is not a theoretical treatment of the subject—the authors themselves recom-

mend other books for this—but rather contains a description of data mining principles and

techniques in a series of “knowledge-transfer” sessions, where examples from real data

mining projects illustrate the main ideas. This aspect of the book makes it most valuable

for practitioners, whether novice or more experienced.

While it would be easier for everyone if data mining were merely a matter of finding and

applying the correct mathematical equation or approach for any given problem, the reality

is that both “art” and “science” are necessary. The “art” in data mining requ ires experience:

when one has seen and overcome the difficulties in finding solutions from among the many

possible approaches, one can apply newfound wisdom to the next project. However, this

process takes considerable time and, particularly for data mining novices, the iterative proces s

inevitable in data mi ning can lead to discouragement when a “textbook” approach doesn’t

yield a good solution.

This book is different; it is organized with the practitioner in mind. The volume is

divided into four parts. Part I provides an overview of analytics from a historical perspec-

tive and frameworks from which to appr oach data mining, including CRISP-DM and

SEMMA. These chapters will provide a novice analyst an excellent overview by defining

terms and methods to use, and will prov ide program managers a framework from which

to approach a wide variety of data mining problems. Part II describes algorithms, though

without extensive mathematics. These will appeal to practitioners who are or will be

involved with day-to-day analytics and need to understand the qua litative aspects of the

algorithms. The inclusion of a chapter on text mining is particularly timely, as text mining

has shown tremend ous growth in recent years.

Part III provides a series of tutorials that are both domain-specific and software-

specific. Any instructor knows that examples make the abstract concept more concrete, and

these tutorials accomplish exactly that. In addition, each tutorial shows how the solutions

were developed using popular data mining software tools, such as Clementine, Enterprise

Miner, Weka, and STATISTICA. The step-by-step specifics will assist practitioners in learning

not only how to approach a wide variety of problems, but also how to use these software

xvii

剩余859页未读，继续阅读

J-10

粉丝: 18

统计分析与数据挖掘实战指南

A Handbook of Statistical Analyses using SAS 英文版 第三版

A Handbook of statistical analysis using sas

Handbook of Statistical Distributions with Applications

handbook of data structures and applications

Handbook of Data Structures and Applications

Data Envelopment Analysis A Handbook of Empirical Studies and Applications 2016

Handbook of Educational Data Mining

数据结构 - Handbook of DATA STRUCTURES and APPLICATIONS

R programming A Handbook of Statistical

The Handbook of Data Mining

最新资源

A Handbook of Statistical Analyses using SAS 英文版第三版