没有合适的资源?快使用搜索试试~ 我知道了~
首页2015年生命科学数据挖掘革命:从基因组测序到统计分析
2015年生命科学数据挖掘革命:从基因组测序到统计分析
需积分: 9 12 下载量 138 浏览量
更新于2023-05-16
收藏 9.84MB PDF 举报
"《2015年数据分析在生命科学中的应用》是一本由Rafael A. Irizarry和Michael I. Love合著的专业书籍,针对21世纪中叶数字化技术推动的生命科学测量革命进行了深入探讨。随着基因组学的发展,新的测量技术如微阵列和下一代测序等使得科学家能够首次观察到先前不可见的分子实体,从而引发了类似显微镜发明后微生物识别等领域的重大发现。这些技术极大地改变了传统依赖简单数据分析的科学领域,以前可能只关注单个基因的转录水平,现在则可以同时测量数千甚至上万个基因。 本书以生物信息学为主题,强调了大数据时代在生命科学研究中的关键作用。作者介绍如何利用R语言进行数据处理和分析,包括安装R和RStudio,学习基本的R编程,以及如何安装和导入数据。书中特别关注统计推断部分,涉及随机变量、零假设检验、分布理论(如正态分布)和样本与估计的概念。中心极限定理和t-分布是进行实际数据推断和假设检验的重要工具,如t检验的应用实例也在书中详细讲解。 此外,由于大规模和复杂数据集带来的挑战,解读数据时需要高级的统计技能,以避免被偶然出现的模式误导。因此,本书不仅涵盖了数据分析的基础知识,还强调了在当今生命科学研究中,统计学的重要性正在不断提升,尤其是在驱动从假设驱动研究转向发现驱动研究的过程中。 《数据分析在生命科学中的应用》以实践为导向,适合科研人员、学生和对生物信息学感兴趣的人士阅读。作为 Leanpub出版平台的作品,它允许作者通过不断迭代和读者反馈,逐步完善内容,最终打造高质量的出版物。版权信息表明,该书于2015年9月23日发布,并鼓励读者参与书的开发过程,共同塑造科学知识的未来。"
资源详情
资源推荐
Getting Started 10
filename <- file.path(dir,"extdata/femaleMiceWeights.csv")
dat <- read.csv(filename)
..
Exercises
Here we will test some of the basics of R data manipulation which you should know or should have
learned by following the tutorials above. You will need to have the file femaleMiceWeights.csv
in your working directory. As we showed above, one way to do this is by using the downloader
package:
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\
ata/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv"
download(url, destfile=filename)
1. Read in the file femaleMiceWeights.csv and report the body weight of the mouse in the
exact name of the column containing the weights.
2. The [ and ] symbols can be used to extract specific rows and specific columns of the table.
What is the entry in the 12th row and second column?
3. You should have learned how to use the $ character to extract a column from a table and
return it as a vector. Use $ to extract the weight column and report the weight of the mouse
in the 11th row.
4. The length function returns the number of elements in a vector. How many mice are
included in our dataset?
5. To create a vector with the numbers 3 to 7, we can use seq(3,7) or, because they are
consecutive, 3:7. View the data and determine what rows are associated with the high fat
or hf diet. Then use the mean function to compute the average weight of these mice.
6. One of the functions we will be using often is sample. Read the help file for sample using
?sample. Now take a random sample of size 1 from the numbers 13 to 24 and report back
the weight of the mouse represented by that row. Make sure to type set.seed(1) to ensure
that everybody gets the same answer.
Brief Introduction to dplyr
The R markdown document for this section is available here²².
The learning curve for R syntax is slow. One of the more difficult aspects that requires some getting
used to is subsetting data tables. The dplyr packages brings these tasks closer to English and we
²²https://github.com/genomicsclass/labs/tree/master/intro/dplyr_intro.Rmd
Getting Started 11
are therefore going to introduce two simple functions: one is used to subset and the other to select
columns.
Take a look at the dataset we read in:
filename <- "femaleMiceWeights.csv"
dat <- read.csv(filename)
head(dat) #In R Studio use View(dat)
## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79
There are two types of diets, which are denoted in the first column. If we want just the weights, we
only need the second column. So if we want the weights for mice on the chow diet, we subset and
filter like this:
library(dplyr)
chow <- filter(dat, Diet=="chow") #keep only the ones with chow diet
head(chow)
## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79
And now we can select only the column with the values:
chowVals <- select(chow,Bodyweight)
head(chowVals)
Getting Started 12
## Bodyweight
## 1 21.51
## 2 28.14
## 3 24.04
## 4 23.45
## 5 23.68
## 6 19.79
A nice feature of the dplyr package is that you can perform consecutive tasks by using what is called
a “pipe”. In dplyr we use %>% to denote a pipe. This symbol tells the program to first do one thing and
then do something else to the result of the first. Hence, we can perform several data manipulations
in one line. For example:
chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight)
In the second task, we no longer have to specify the object we are editing since it is whatever comes
from the previous call.
Also, note that if dplyr receives a data.frame it will return a data.frame.
class(dat)
## [1] "data.frame"
class(chowVals)
## [1] "data.frame"
For pedagogical reasons, we will often want the final result to be a simple numeric vector. To obtain
such a vector with dplyr, we can apply the unlist function which turns lists, such as data.frames,
into numeric vectors:
chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight) %>% unlist
class( chowVals )
Getting Started 13
## [1] "numeric"
To do this in R without dplyr the code is the following:
chowVals <- dat[ dat$Diet=="chow", colnames(dat)=="Bodyweight"]
..
Exercises
For these exercises, we will use a new dataset related to mammalian sleep. This data is described
here. Download the CSV file from this location:
We are going to read in this data, then test your knowledge of they key dplyr functions select
and filter. We are also going to review two different classes: data frames and vectors.
1. Read in the msleep_ggplot2.csv file with the function read.csv and use the function class
to determine what type of object is returned.
2. Now use the filter function to select only the primates. How many animals in the table are
primates? Hint: the nrow function gives you the number of rows of a data frame or matrix.
3. What is the class of the object you obtain after subsetting the table to only include primates?
4. Now use the select function to extract the sleep (total) for the primates. What class is this
object? Hint: use %>% to pipe the results of the filter function to select.
5. Now we want to calculate the average amount of sleep for primates (the average of the
numbers computed above). One challenge is that the mean function requires a vector so, if
we simply apply it to the output above, we get an error. Look at the help file for unlist and
use it to compute the desired average.
6. For the last exercise, we could also use the dplyr summarize function. We have not introduced
this function, but you can read the help file and repeat exercise 5, this time using just filter
and summarize to get the answer.
http://docs.ggplot2.org/0.9.3.1/msleep.html
Mathematical Notation
The R markdown document for this section is available here²³.
This book focuses on teaching statistical concepts and data analysis programming skills. We avoid
mathematical notation as much as possible, but we do use it. We do not want readers to be
intimidated by the notation though. Mathematics is actually the easier part of learning statistics.
²³https://github.com/genomicsclass/labs/tree/master/intro/math_notation.Rmd
Getting Started 14
Unfortunately, many text books use mathematical notation in what we believe to be an over-
complicated way. For this reason, we do try to keep the notation as simple as possible. However,
we do not want to water down the material, and some mathematical notation facilitates a deeper
understanding of the concepts. Here we describe a few specific symbols that we use often. If they
appear intimidating to you, please take some time to read this section carefully as they are actually
simpler than they seem. Because by now you should be somewhat familiar with R, we will make
the connection between mathematical notation and R code.
Indexing
Those of us dealing with data almost always have a series of numbers. To describe the concepts in
an abstract way, we use indexing. For example 5 numbers:
x <- 1:5
can be generally represented like this x
1
, x
2
, x
3
, x
4
, x
5
. We use dots to simplify this x
1
, . . . , x
5
and
indexing to simplify even more x
i
, i = 1, . . . , 5. If we want to describe a procedure for a list of any
size n , we write x
i
, i = 1, . . . , n.
We sometimes have two indexes. For example, we may have several measurements (blood pressure,
weight, height, age, cholesterol level) for 100 individuals. We can then use double indexes: x
i,j
, i =
1, . . . , 100, j = 1, . . . , 5.
Summation
A very common operation in data analysis is to sum several numbers. This is comes up, for
example, when we compute averages and standard deviations. If we have many numbers, there
is a mathematical notation that makes it quite easy to express the following:
n <- 1000
x <- 1:n
S <- sum(x)
and it is the
notation (capital S in Greek):
S =
n
i=1
x
i
Note that we make use of indexing as well. We will see that what is included inside the summation
can become quite complicated. However, the summation part should not confuse you as it is a simple
operation.
剩余465页未读,继续阅读
moreCilantro
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- 计算机系统基石:深度解析与优化秘籍
- 《ThinkingInJava》中文版:经典Java学习宝典
- 《世界是平的》新版:全球化进程加速与教育挑战
- 编程珠玑:程序员的基础与深度探索
- C# 语言规范4.0详解
- Java编程:兔子繁殖与素数、水仙花数问题探索
- Oracle内存结构详解:SGA与PGA
- Java编程中的经典算法解析
- Logback日志管理系统:从入门到精通
- Maven一站式构建与配置教程:从入门到私服搭建
- Linux TCP/IP网络编程基础与实践
- 《CLR via C# 第3版》- 中文译稿,深度探索.NET框架
- Oracle10gR2 RAC在RedHat上的安装指南
- 微信技术总监解密:从架构设计到敏捷开发
- 民用航空专业英汉对照词典:全面指导航空教学与工作
- Rexroth HVE & HVR 2nd Gen. Power Supply Units应用手册:DIAX04选择与安装指南
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功