R实用工具探索数据与建模实践

需积分: 50 94 浏览量更新于2024-07-21 收藏 1.81MB PDF 举报

《实用的R数据分析与模型探索工具》是一本专为数据分析师和统计建模者设计的指南，它详细介绍了在实际数据处理和模型构建过程中所使用的三种关键工具。作者按照典型的数据分析流程组织内容，这个流程包括：数据获取、数据可视化和模型迭代。首先，章节开始于数据的获取，确保数据处于可供分析的状态。这涵盖了数据清洗、整理和加载到R环境中的各个环节，可能涉及到数据导入（如`read.csv`或`read.table`）以及数据预处理（如处理缺失值、异常值和格式转换）。接着，书中着重于数据探索阶段，这部分强调了图形在理解数据分布、关系和趋势中的核心作用。这部分可能会涵盖各种R图形库，如`ggplot2`用于创建美观且具有深度的可视化，如散点图、直方图、箱线图等，帮助分析师直观地发现数据的模式和异常情况。然后，是模型构建的迭代过程。作者倡导通过图形与模型之间的交互来逐步提炼数据的定量总结。这可能包括线性回归、决策树、聚类算法或者深度学习模型（如用`caret`或`tidymodels`包进行），每一步都会通过图形反馈来评估模型性能，并根据结果调整模型参数或选择不同的模型结构。书中还可能涉及统计推断和假设检验，例如使用`t.test`、`anova`或`wilcox.test`等函数，以及如何解释模型的输出和预测结果。同时，也会介绍如何进行模型验证和优化，如交叉验证和网格搜索技术。最后，本书会探讨如何将探索结果和模型集成到报告或演示中，确保有效地传达分析结论和见解。这部分可能包含如何使用R Markdown或Shiny等工具制作交互式文档，以便于他人理解和复现分析过程。《实用的R数据分析与模型探索工具》为读者提供了一套全面而实用的方法，帮助他们利用R语言在数据驱动的世界中进行深入挖掘和高效建模。无论是初学者还是经验丰富的专业人士，都能从中找到有价值的内容，提升他们的数据分析能力。

1 Introduction

(casting). This framework is implemented in the reshape package and the chapter has

been published in the Journal of Statistical Software (Wickham, 2007c).

1.2 Plotting data

Plotting data is a critical part of exploratory data analysis, helping us to see the bulk of our

data, as well as highlighting the unusual. As Tukey once said: “numerical quantities focus

on expected values, graphical summaries on unexpected values.”

Unfortunately, current open-source systems for creating graphics are sorely lacking from

a practical perspective. The R environment for statistical computing provides the richest set

of graphical tools, split into two libraries: base graphics (R Development Core Team, 2007)

and lattice (Sarkar, 2006). Base graphics has a primitive pen on paper model, and while

lattice is a step up, it has fundamental limitations. Compared to base graphics, lattice takes

care of many of the minor technical details that require manual tweaking in base graphics,

in particular providing matching legends and maintaining common scales across multiple

plots. However, attempting to extend lattice raises fundamental questions: why are there

separate functions for scatterplot and dotplots when they seem so similar? Why can you

only log transform scales and not use other functions? What makes adding error bars to

a plot so complicated? Extending lattice also reveals another problem. Once a lattice plot

object is created, it is very difﬁcult to modify it in a maintainable way: the components of

the lattice model of graphics (Becker et al., 1996) are designed for a very speciﬁc type of

display, and do not generalise well to other graphics we may wish to produce.

To do better, we need a framework that incorporates a very wide range of graphics.

There have been two main attempts to develop such a framework of statistical graphics, by

Bertin and Wilkinson. Bertin (1983) focuses on geographical visualisation, but also lays

out principles for sound graphical construction, including suggested mappings between

different types of variables and visual properties. All graphics are hand drawn, and while

the underlying principles are sound, the practice of drawing graphics on a computer is

rather different. The Grammar of Graphics (Wilkinson, 2005) is more modern and presents

a way to concisely and formally describe a graphic. Instead of coming up with a new

name for your graphic, and giving a lengthy, textual description, you can instead describe

the exact components which deﬁne your graphic. The grammar is composed of seven

components, as follows:

• Data. The most important part of any plot. Data reshaping is the responsibility of

the algebra, which consists of three operators (nesting, crossing and blending).

• Transformations create new variables from functions of existing variables, e.g. log-

transforming a variable.

• Scales control the mapping between variables and aesthetic properties like colour

and size.

• The geometric element speciﬁes the type of object used to display the data, e.g.

points, lines, bars.

1.3 Visualising models

• A statistic optionally summarises the data. Statistics are critical parts of certain

graphics (e.g. the bar chart and histogram).

• The coordinate system is responsible for computing positions on the 2d plane of

the plotting surface, which is usually the Cartesian coordinate system. A subset of

the coordinate system is facetting, which displays different subsets of the data in

small multiples, generalisation of trellising (Becker et al., 1996) which allows for

non-rectangular layout.

• Guides, axes and legends, enable the reading of data values from the graph.

Wilkinson’s grammar successfully describes a broad range of graphics, but is hampered

by a lack of an available implementation: we can not use the grammar or test its claims.

These issues are discussed by Cox (2007), which provides a comprehensive review of the

book.

To resolve these two problems, I implemented the grammar in R. This started as a direct

implementation of the ideas in the book, but as I proceeded it became clear that there

are areas in which the grammar could be improved. This lead to the development of

a grammar of layered graphics, described in Chapter 3. The work extends and reﬁnes

the work of Wilkinson, and is implemented in the R package ggplot2 (Wickham, 2008).

This chapter has been tentatively accepted by the Journal of Computational and Graphical

Statistics, and a revised version will be resubmitted shortly.

1.3 Visualising models

Graphics give us a qualitative feel for the data, helping us to make sense of what’s going

on. That is often not enough: many times we also need a precise mathematical model

which allows us to make predictions with quantiﬁable uncertainty. A model is also useful

as a concise mathematical summary, succinctly describing the main features of the data.

To build a good model, we need some way to compare it to the data and investigate

how well it captures the salient features. To understand the model and how well it ﬁts the

data, we need tools for exploratory model analysis Unwin et al. (2003); Urbanek (2004).

Graphics and models make different assumptions and have different biases. Models are not

prone to human perceptual biases caused by the simplifying assumptions we make about

the world, but they do have their own set of simplifying assumptions, typically required

to make mathematical analysis tractable. Using one to validate the other allows us to

overcome the limitations of each.

Chapter 4 describes three strategies for visualising statistical models. These strategies

emphasise displaying the model in the context of the data, looking at many models and ex-

ploring the process of model ﬁtting, as well as the ﬁnal result. This chapter pulls together

my experience building visualisations for classiﬁcation, clustering and ensembles of linear

models, as implemented by the R packages clusterfly (Wickham, 2007b), classifly

(Wickham, 2007a), and meifly (Wickham, 2007a). I plan to submit this paper to Compu-

tational Statistics.

Chapter 2

Reshaping data with the reshape package

Abstract

This paper presents the reshape package for R, which provides a common framework for

many types of data reshaping and aggregation. It uses a paradigm of ‘melting’ and ‘cast-

ing’, where the data are ‘melted’ into a form which distinguishes measured and identifying

variables, and then ‘cast’ into a new shape, whether it be a data frame, list, or high dimen-

sional array. The paper includes an introduction to the conceptual framework, practical

advice for melting and casting, and a case study.

2.1 Introduction

Reshaping data is a common task in real-life data analysis, and it’s usually tedious and

frustrating. You’ve struggled with this task in Excel, in SAS, and in R: how do you get your

clients’ data into the form that you need for summary and analysis? This paper describes

version 0.8.1 of the reshape package for R (R Development Core Team, 2007), which

presents a new approach that aims to reduce the tedium and complexity of reshaping data.

Data often has multiple levels of grouping (nested treatments, split plot designs, or re-

peated measurements) and typically requires investigation at multiple levels. For example,

from a long term clinical study we may be interested in investigating relationships over

time, or between times or patients or treatments. To make your job even more difﬁcult,

the data probably has been collected and stored in a way optimised for ease and accuracy

of collection, and in no way resembles the form you need for statistical analysis. You need

to be able to ﬂuently and ﬂuidly reshape the data to meet your needs, but most software

packages make it difﬁcult to generalise these tasks, and new code needs to be written for

each new case.

While you’re probably familiar with the idea of reshaping, it is useful to be a little more

formal. Data reshaping involves a rearrangement of the form, but not the content, of the

data. Reshaping is a little like creating a contingency table, as there are many ways to

arrange the same data, but it is different in that there is no aggregation involved. The

tools presented in this paper work equally well for reshaping, retaining all existing data,

and aggregating, summarising the data, and later we will explore the connection between

the two.

In R, there are a number of general functions that can aggregate data, for example

2 Reshaping data with the reshape package

tapply, by and aggregate, and a function speciﬁcally for reshaping data, reshape. Each

of these functions tends to deal well with one or two speciﬁc scenarios, and each requires

slightly different input arguments. In practice, you need careful thought to piece together

the correct sequence of operations to get your data into the form that you want. The

reshape package grew out of my frustrations with reshaping data for consulting clients,

and overcomes these problems with a general conceptual framework that uses just two

functions: melt and cast.

The paper introduces this framework, which will help you think about the fundamental

operations that you perform when reshaping and aggregating data, but the main emphasis

is on the practical tools, detailing the many forms of data that melt can consume and that

cast can produce. A few other useful functions are introduced, and the paper concludes

with a case study, using reshape in a real-life example.

2.2 Conceptual framework

To help us think about the many ways we might rearrange a data set, it is useful to think

about data in a new way. Usually, we think about data in terms of a matrix or data frame,

where we have observations in the rows and variables in the columns. For the purposes of

reshaping, we can divide the variables into two groups: identiﬁer and measured variables.

1. Identiﬁer (id) variables identify the unit that measurements take place on. Id vari-

ables are usually discrete, and are typically ﬁxed by design. In ANOVA notation (Y

ijk

id variables are the indices on the variables (i, j, k); in database notation, id variables

are a composite primary key.

2. Measured variables represent what is measured on that unit (Y ).

It is possible to take this abstraction one step further and say there are only id variables and

a value, where the id variables also identify what measured variable the value represents.

For example, we could represent this data set, which has two id variables (subject and

time):

subject time age weight height

1 John Smith 1 33 90 2

2 Mary Smith 1 2

as:

subject time variable value

1 John Smith 1 age 33

2 John Smith 1 weight 90

3 John Smith 1 height 2

4 Mary Smith 1 height 2

where each row now represents one observation of one variable. This operation is called

melting and produces ‘molten’ data. Compared to the original data set, the molten data

has a new id variable ‘variable’, and a new column ‘value’, which represents the value of

剩余104页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

zhmxu

粉丝: 0

R实用工具探索数据与建模实践

补充资料《Exploring Humanities Data Types with R》

"ANTs工具官方介绍文档详解：解决实际数据问题的实用示例与亮点

Bioinformatics: Exploring Local Alignment with Blast and Tools

【Practical Exercise】Data Storage and Analysis: Storing Scraped Data into MongoDB and Conducting ...

Building Probabilistic Graphical Models with Python

Mastering.Core.Data.With.Swift.2017

Alternative to MATLAB Toolboxes: Exploring Similar Tools to Find the Best Fit for You

The Impact of OpenCV and Python Versions in Computer Vision Applications: A Case Study Exploring ...

Exploring Alternative Methods for Uninstalling MATLAB: Meeting MATLAB Functional Requirements with ...

5 Challenges in Nonlinear Analysis of Partial Differential Equations: Exploring Chaos and Singular ...

最新资源