使用R进行数据分析：导入、整理、转换、可视化和建模

需积分: 9 145 浏览量更新于2024-07-17 1 收藏 32.41MB PDF 举报

"R for Data Science 是一本由Hadley Wickham和Garrett Grolemund合著的书籍，专注于数据科学中的数据处理流程，包括导入、整理、转换、可视化和建模等核心步骤。这本书旨在帮助读者掌握R语言在数据科学领域的应用，通过一系列实用工具和技术，提升数据分析能力。" 在《R for Data Science》中，作者强调了数据科学工作流程的五个关键部分： 1. **导入数据（Import）**：这是数据分析的第一步，涉及将各种数据源（如CSV、Excel、数据库或API）的数据加载到R环境中。R提供了多种包，如`readr`、`data.table`和`dbConnect`，使得数据导入过程变得简单高效。 2. **整理数据（Tidy Data）**：整洁的数据是分析的基础，意味着每列代表一个变量，每行代表一个观测值，每个表都有单一且明确的观察单位。`dplyr`包提供了一套强大的数据操作函数，如`select`、`filter`、`mutate`和`arrange`，用于数据清洗和转换。 3. **转换数据（Transform）**：数据通常需要进行各种转换才能适应分析需求，这包括计算新变量、处理缺失值、标准化或归一化数值。`tidyr`包用于处理不规则的数据结构，而`purrr`包则提供了函数式编程工具，简化了批量处理数据的操作。 4. **可视化数据（Visualize）**：数据可视化是理解数据和传达发现的关键。`ggplot2`是R中用于创建高质量图形的首选包，它基于Grammar of Graphics理论，允许用户构建复杂图表，同时保持代码简洁易读。 5. **建模数据（Model Data）**：在理解数据关系后，我们通常会建立模型来预测或解释现象。R提供了丰富的统计建模工具，如`lm`、`glm`、`randomForest`和`caret`包，涵盖了线性回归、广义线性模型、机器学习等多种方法。此外，书中还讨论了如何组织代码、进行可重复性研究以及利用版本控制工具如Git进行协作。通过实践案例和代码示例，读者可以深入学习并掌握这些技能，从而在数据科学项目中更有效地运用R语言。

tory analysis). The focus of this book is unabashedly on hypothesis

generation, or data exploration. Here you’ll look deeply at the data

and, in combination with your subject knowledge, generate many

interesting hypotheses to help explain why the data behaves the way

it does. You evaluate the hypotheses informally, using your skepti‐

cism to challenge the data in multiple ways.

The complement of hypothesis generation is hypothesis confirma‐

tion. Hypothesis confirmation is hard for two reasons:

• You need a precise mathematical model in order to generate fal‐

sifiable predictions. This often requires considerable statistical

sophistication.

• You can only use an observation once to confirm a hypothesis.

As soon as you use it more than once you’re back to doing

exploratory analysis. This means to do hypothesis confirmation

you need to “preregister” (write out in advance) your analysis

plan, and not deviate from it even when you have seen the data.

We’ll talk a little about some strategies you can use to make this

easier in Part IV.

It’s common to think about modeling as a tool for hypothesis confir‐

mation, and visualization as a tool for hypothesis generation. But

that’s a false dichotomy: models are often used for exploration, and

with a little care you can use visualization for confirmation. The key

difference is how often you look at each observation: if you look

only once, it’s confirmation; if you look more than once, it’s explora‐

tion.

Prerequisites

We’ve made a few assumptions about what you already know in

order to get the most out of this book. You should be generally

numerically literate, and it’s helpful if you have some programming

experience already. If you’ve never programmed before, you might

find Hands-On Programming with R by Garrett to be a useful

adjunct to this book.

There are four things you need to run the code in this book: R,

RStudio, a collection of R packages called the tidyverse, and a hand‐

ful of other packages. Packages are the fundamental units of repro‐

xiv | Preface

• If we want to make it clear what package an object comes from,

we’ll use the package name followed by two colons, like

dplyr::mutate() or nycflights13::flights. This is also valid

R code.

Getting Help and Learning More

This book is not an island; there is no single resource that will allow

you to master R. As you start to apply the techniques described in

this book to your own data you will soon find questions that I do

not answer. This section describes a few tips on how to get help, and

to help you keep learning.

If you get stuck, start with Google. Typically, adding “R” to a query

is enough to restrict it to relevant results: if the search isn’t useful, it

often means that there aren’t any R-specific results available. Google

is particularly useful for error messages. If you get an error message

and you have no idea what it means, try googling it! Chances are

that someone else has been confused by it in the past, and there will

be help somewhere on the web. (If the error message isn’t in English,

run Sys.setenv(LANGUAGE = "en") and re-run the code; you’re

more likely to find help for English error messages.)

If Google doesn’t help, try stackoverflow. Start by spending a little

time searching for an existing answer; including [R] restricts your

search to questions and answers that use R. If you don’t find any‐

thing useful, prepare a minimal reproducible example or reprex. A

good reprex makes it easier for other people to help you, and often

you’ll figure out the problem yourself in the course of making it.

There are three things you need to include to make your example

reproducible: required packages, data, and code:

•

Packages should be loaded at the top of the script, so it’s easy to

see which ones the example needs. This is a good time to check

that you’re using the latest version of each package; it’s possible

you’ve discovered a bug that’s been fixed since you installed the

package. For packages in the tidyverse, the easiest way to check

is to run tidyverse_update().

• The easiest way to include data in a question is to use dput() to

generate the R code to re-create it. For example, to re-create the

mtcars dataset in R, I’d perform the following steps:

xviii | Preface

剩余519页未读，继续阅读

weixin_41684293

粉丝: 0
资源: 1

使用R进行数据分析：导入、整理、转换、可视化和建模

R for data science solutions

R Data Import/Export

R for Data Science 原版PDF by Wickham & Grolemund

R for Data Science：可视化，建模，转换，整理和导入数据R for Data Science: Visualize, Model, Transform, Tidy, and Import Data

R for Data Science 等三本

r for data science 笔记代码.R

R_for_data_science：Hadley Wickham的R for Data Science书籍中的记录和解决问题

R for Data Science 高清彩色带标签

R for Data Science 和书中数据

R for Data Science Visualize Model Transform Tidy and Import Data.pdf

最新资源