掌握R语言进行预测分析

需积分: 9 116 浏览量更新于2024-07-21 收藏 3.96MB PDF 举报

"Learning Predictive Analytics with R" 是一本由PACKT出版于2015年的图书，旨在帮助读者掌握使用R语言进行预测性分析的技能。书中涵盖了数据可视化和预测分析的关键工具，适合想要在R环境中探索数据知识的读者。这本书深入浅出地介绍了R语言在数据分析领域的应用，特别强调了无监督学习和监督学习这两种主要的数据学习方式。无监督学习主要涉及数据结构的自动提取，而监督学习则利用带有标签的数据来学习目标属性的关系或评分。R语言凭借其丰富的标准和前沿统计功能，能有效挖掘隐藏在大量数据中的重要信息。在内容部分，作者Eric Mayor通过实例和易懂的指导，让读者能够掌握R中的关键数据挖掘技术。读者将学习如何使用R进行数据预处理、特征工程、模型构建、评估和优化。此外，书中还涵盖了数据可视化，这是理解和解释分析结果的关键步骤，包括使用R中的ggplot2等包创建有效的图表。书中的预测分析部分可能涉及到回归分析、决策树、随机森林、支持向量机、神经网络以及集成学习方法如梯度提升（Gradient Boosting）和堆叠泛化（Stacking）。这些方法在预测建模中有着广泛的应用，可用于分类、回归和异常检测等任务。在数据可视化方面，读者将学习如何使用R中的各种库，如ggplot2、 lattice 和 plotly，创建交互式图表，以更好地理解数据分布、关系和模式。此外，书中可能会讨论如何使用R进行时间序列分析，这对于金融、气象学和市场营销等领域尤为重要。 “Learning Predictive Analytics with R”是一本全面的指南，它不仅教授R语言的统计基础，还强调实践应用，使读者能够利用R进行高效且深入的数据分析，从而发现有价值的洞察。通过这本书，读者可以提升自己的预测分析能力，为实际工作中的决策提供有力支持。

Preface

[ vii ]

Preface

The amount of data in the world is increasing exponentially as time passes. It is

estimated that the total amount of data produced in 2020 will be 20 zettabytes

(Kotov, 2014), that is, 20 billion terabytes. Organizations spend a lot of effort and

money on collecting and storing data, and still, most of it is not analyzed at all, or

not analyzed properly. One reason to analyze data is to predict the future, that is, to

produce actionable knowledge. The main purpose of this book is to show you how

to do that with reasonably simple algorithms. The book is composed of chapters

describing the algorithms and their use and of an appendices with exercises and

solutions to the exercises and references.

Prediction

What is meant by prediction? The answer, of course, depends on the eld and the

algorithms used, but this explanation is true most of the time—given the attested

reliable relationships between indicators (predictors) and an outcome, the presence

(or level) of the indicators for similar cases is a reliable clue to the presence (or level)

of the outcome in the future. Here are some examples of relationships, starting with

the most obvious:

• Taller people weigh more

• Richer individuals spend more

• More intelligent individuals earn more

• Customers in segment X buy more of product Y

• Customers who bought product P will also buy product Q

• Products P and Q are bought together

• Some credit card transactions predict fraud (Chan et al., 1999)

• Google search queries predict inuenza infections (Ginsberg et al., 2009)

• Tweet content predicts election poll outcomes (O'Connor and

Balasubramanyan, 2010)

www.it-ebooks.info

Preface

[ xi ]

Believe it or not, the population these observations come from is that of randomly

generated numbers. We generated a data frame of 50 columns of 50 randomly

generated numbers. We then examined all the correlations (manually) and generated

a scatterplot of the two attributes with the largest correlation we found. The code is

provided here, in case you want to check it yourself—line 1 sets the seed so that you

nd the same results as we did, line 2 generates to the data frame, line 3 lls it with

random numbers, column by column, line 4 generates the scatterplot, line 5 ts the

regression line, and line 6 tests the signicance of the correlation:

1 set.seed(1)

2 DF = data.frame(matrix(nrow=50,ncol=50))

3 for (i in 1:50) DF[,i] = runif(50)

4 plot(DF[[2]],DF[[16]], xlab = "Predictor", ylab = "Outcome")

5 abline(lm(DF[[2]]~DF[[16]]))

6 cor.test(DF[[2]], DF[[16]])

How could this relationship happen given that the odds were 2.4 in 1000 ? Well,

think of it; we correlated all 50 attributes 2 x 2, which resulted in 2,450 tests (not

considering the correlation of each attribute with itself). Such spurious correlation

was quite expectable. The usual threshold below which we consider a relationship

signicant is p = 0.05, as we will discuss in Chapter 8, Probability Distributions,

Covariance, and Correlation. This means that we expect to be wrong once in 20 times.

You would be right to suspect that there are other signicant correlations in the

generated data frame (there should be approximately 125 of them in total). This is

the reason why we should always correct the number of tests. In our example, as

we performed 2,450 tests, our threshold for signicance should be 0.0000204 (0.05 /

2450). This is called the Bonferroni correction.

Spurious correlations are always a possibility in data analysis and this should be

kept in mind at all times. A related concept is that of overtting. Overtting happens,

for instance, when a weak classier bases its prediction on the noise in data. We

will discuss overtting in the book, particularly when discussing cross-validation

in Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive

Models Using PMML. All the chapters are listed in the following section.

We hope you enjoy reading the book and hope you learn a lot from us!

www.it-ebooks.info

剩余332页未读，继续阅读

vanridin

粉丝: 108
资源: 1187

掌握R语言进行预测分析

利用R掌握数据可视化与预测分析实战

Python机器学习预测分析实战

Python预测分析实战：基于公开数据集的预测建模

Python Machine Learning(PACKT,2015)

Large Scale Machine Learning with Spark-Packt Publishing(2016).pdf

Python: Advanced Predictive Analytics: 2017-12 python重量级高级大数据分析预测

Python_Data Analytics and Visualization-Packt Publishing(2017).epub

Python_Data+Analytics+and+Visualization-Packt+Publishing(2017).pdf

Practical+Machine+Learning-Packt+Publishing(2016).epub

R Data Analysis Projects-Packt Publishing(2017).pdf

最新资源