掌握R语言：数据科学实战指南

需积分: 9 165 浏览量更新于2024-07-18 2 收藏 32.31MB PDF 举报

"R for Data Science 是一本由 Hadley Wickham 和 Garrett Grolemund 合著的数据科学领域经典著作，旨在帮助读者掌握使用R语言进行数据处理的核心技能，包括数据导入、整理、转换、可视化和建模。" 《R for Data Science》这本书详细介绍了在数据科学实践中使用R语言的关键步骤，其核心理念是“IMPORT, TIDY, TRANSFORM, VISUALIZE, AND MODEL DATA”，即数据的导入、整理、转换、可视化和建模过程。以下是对这些关键概念的深入解析： 1. **数据导入（IMPORT）**：在数据分析的初始阶段，数据通常来自各种来源，如CSV文件、数据库或API。R语言提供了多种包，如`readr`和`data.table`，用于高效地读取和导入不同格式的数据。了解如何正确导入数据至关重要，因为它直接影响后续分析的质量。 2. **数据整理（TIDY）**：tidyverse是R中的一个核心概念，它是一系列包的集合，旨在提供一致且高效的工具来处理数据。其中，`dplyr`包用于数据操作，`tidyr`包则用于将数据转换成“整洁”格式，使得每个变量占据一列，每个观测占据一行，从而方便分析。 3. **数据转换（TRANSFORM）**：数据转换是数据分析的关键步骤，包括数据清洗、计算新变量、处理缺失值等。`dplyr`包提供了诸如`filter`, `select`, `mutate`, `summarize`等函数，使得这些操作变得简单易行。此外，`tidyr`的`gather`和`spread`函数可用于处理宽格式和长格式数据之间的转换。 4. **数据可视化（VISUALIZE）**：书中强调了数据可视化的重要性，`ggplot2`是R中最著名的图形包，它遵循层叠原则构建复杂图表，允许用户逐步添加元素，如几何对象、坐标轴、图例和颜色，以创建专业质量的图形。 5. **模型构建（MODEL DATA）**：在R中，有多种库支持统计建模和机器学习，如`lm`和`glm`用于线性模型和广义线性模型，`caret`提供了一种统一的接口来比较和选择不同模型，`randomForest`和`xgboost`用于构建决策树和随机森林模型。理解模型的原理以及如何评估和解释模型是这个阶段的重点。本书通过实际案例和代码示例，深入浅出地讲解了这些概念，适合数据科学初学者和经验丰富的专业人士。作者Hadley Wickham是R社区的重要贡献者，他开发了许多流行的数据科学工具，如`tidyverse`套件，而Garrett Grolemund是一位资深的数据科学家和教育家，他们两人的合作使得这本书成为R语言和数据科学实践的权威指南。

tory analysis). The focus of this book is unabashedly on hypothesis

generation, or data exploration. Here you’ll look deeply at the data

and, in combination with your subject knowledge, generate many

interesting hypotheses to help explain why the data behaves the way

it does. You evaluate the hypotheses informally, using your skepti‐

cism to challenge the data in multiple ways.

The complement of hypothesis generation is hypothesis confirma‐

tion. Hypothesis confirmation is hard for two reasons:

• You need a precise mathematical model in order to generate fal‐

sifiable predictions. This often requires considerable statistical

sophistication.

• You can only use an observation once to confirm a hypothesis.

As soon as you use it more than once you’re back to doing

exploratory analysis. This means to do hypothesis confirmation

you need to “preregister” (write out in advance) your analysis

plan, and not deviate from it even when you have seen the data.

We’ll talk a little about some strategies you can use to make this

easier in Part IV.

It’s common to think about modeling as a tool for hypothesis confir‐

mation, and visualization as a tool for hypothesis generation. But

that’s a false dichotomy: models are often used for exploration, and

with a little care you can use visualization for confirmation. The key

difference is how often you look at each observation: if you look

only once, it’s confirmation; if you look more than once, it’s explora‐

tion.

Prerequisites

We’ve made a few assumptions about what you already know in

order to get the most out of this book. You should be generally

numerically literate, and it’s helpful if you have some programming

experience already. If you’ve never programmed before, you might

find Hands-On Programming with R by Garrett to be a useful

adjunct to this book.

There are four things you need to run the code in this book: R,

RStudio, a collection of R packages called the tidyverse, and a hand‐

ful of other packages. Packages are the fundamental units of repro‐

xiv | Preface

• If we want to make it clear what package an object comes from,

we’ll use the package name followed by two colons, like

dplyr::mutate() or nycflights13::flights. This is also valid

R code.

Getting Help and Learning More

This book is not an island; there is no single resource that will allow

you to master R. As you start to apply the techniques described in

this book to your own data you will soon find questions that I do

not answer. This section describes a few tips on how to get help, and

to help you keep learning.

If you get stuck, start with Google. Typically, adding “R” to a query

is enough to restrict it to relevant results: if the search isn’t useful, it

often means that there aren’t any R-specific results available. Google

is particularly useful for error messages. If you get an error message

and you have no idea what it means, try googling it! Chances are

that someone else has been confused by it in the past, and there will

be help somewhere on the web. (If the error message isn’t in English,

run Sys.setenv(LANGUAGE = "en") and re-run the code; you’re

more likely to find help for English error messages.)

If Google doesn’t help, try stackoverflow. Start by spending a little

time searching for an existing answer; including [R] restricts your

search to questions and answers that use R. If you don’t find any‐

thing useful, prepare a minimal reproducible example or reprex. A

good reprex makes it easier for other people to help you, and often

you’ll figure out the problem yourself in the course of making it.

There are three things you need to include to make your example

reproducible: required packages, data, and code:

•

Packages should be loaded at the top of the script, so it’s easy to

see which ones the example needs. This is a good time to check

that you’re using the latest version of each package; it’s possible

you’ve discovered a bug that’s been fixed since you installed the

package. For packages in the tidyverse, the easiest way to check

is to run tidyverse_update().

• The easiest way to include data in a question is to use dput() to

generate the R code to re-create it. For example, to re-create the

mtcars dataset in R, I’d perform the following steps:

xviii | Preface

剩余519页未读，继续阅读

丁兆海1991

粉丝: 11
资源: 13

掌握R语言：数据科学实战指南

R for data science solutions

R for Data Science 原版PDF by Wickham & Grolemund

R for Data Science(O'Reilly,2016)

R for Data Science：可视化，建模，转换，整理和导入数据R for Data Science: Visualize, Model, Transform, Tidy, and Import Data

R for Data Science 等三本

r for data science 笔记代码.R

R_for_data_science：Hadley Wickham的R for Data Science书籍中的记录和解决问题

R for Data Science 高清彩色带标签

R for Data Science 和书中数据

R for Data Science Visualize Model Transform Tidy and Import Data.pdf

最新资源