2015年生命科学数据挖掘革命：从基因组测序到统计分析

需积分: 9 19 浏览量更新于2023-05-16 收藏 9.84MB PDF 举报

"《2015年数据分析在生命科学中的应用》是一本由Rafael A. Irizarry和Michael I. Love合著的专业书籍，针对21世纪中叶数字化技术推动的生命科学测量革命进行了深入探讨。随着基因组学的发展，新的测量技术如微阵列和下一代测序等使得科学家能够首次观察到先前不可见的分子实体，从而引发了类似显微镜发明后微生物识别等领域的重大发现。这些技术极大地改变了传统依赖简单数据分析的科学领域，以前可能只关注单个基因的转录水平，现在则可以同时测量数千甚至上万个基因。本书以生物信息学为主题，强调了大数据时代在生命科学研究中的关键作用。作者介绍如何利用R语言进行数据处理和分析，包括安装R和RStudio，学习基本的R编程，以及如何安装和导入数据。书中特别关注统计推断部分，涉及随机变量、零假设检验、分布理论（如正态分布）和样本与估计的概念。中心极限定理和t-分布是进行实际数据推断和假设检验的重要工具，如t检验的应用实例也在书中详细讲解。此外，由于大规模和复杂数据集带来的挑战，解读数据时需要高级的统计技能，以避免被偶然出现的模式误导。因此，本书不仅涵盖了数据分析的基础知识，还强调了在当今生命科学研究中，统计学的重要性正在不断提升，尤其是在驱动从假设驱动研究转向发现驱动研究的过程中。《数据分析在生命科学中的应用》以实践为导向，适合科研人员、学生和对生物信息学感兴趣的人士阅读。作为 Leanpub出版平台的作品，它允许作者通过不断迭代和读者反馈，逐步完善内容，最终打造高质量的出版物。版权信息表明，该书于2015年9月23日发布，并鼓励读者参与书的开发过程，共同塑造科学知识的未来。"

Getting Started 10

filename <- file.path(dir,"extdata/femaleMiceWeights.csv")

dat <- read.csv(filename)

Exercises

Here we will test some of the basics of R data manipulation which you should know or should have

learned by following the tutorials above. You will need to have the file femaleMiceWeights.csv

in your working directory. As we showed above, one way to do this is by using the downloader

package:

library(downloader)

url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\

ata/femaleMiceWeights.csv"

filename <- "femaleMiceWeights.csv"

download(url, destfile=filename)

1. Read in the file femaleMiceWeights.csv and report the body weight of the mouse in the

exact name of the column containing the weights.

2. The [ and ] symbols can be used to extract specific rows and specific columns of the table.

What is the entry in the 12th row and second column?

3. You should have learned how to use the $ character to extract a column from a table and

return it as a vector. Use $ to extract the weight column and report the weight of the mouse

in the 11th row.

4. The length function returns the number of elements in a vector. How many mice are

included in our dataset?

5. To create a vector with the numbers 3 to 7, we can use seq(3,7) or, because they are

consecutive, 3:7. View the data and determine what rows are associated with the high fat

or hf diet. Then use the mean function to compute the average weight of these mice.

6. One of the functions we will be using often is sample. Read the help file for sample using

?sample. Now take a random sample of size 1 from the numbers 13 to 24 and report back

the weight of the mouse represented by that row. Make sure to type set.seed(1) to ensure

that everybody gets the same answer.

Brief Introduction to dplyr

The R markdown document for this section is available here²².

The learning curve for R syntax is slow. One of the more difficult aspects that requires some getting

used to is subsetting data tables. The dplyr packages brings these tasks closer to English and we

²²https://github.com/genomicsclass/labs/tree/master/intro/dplyr_intro.Rmd

Getting Started 13

## [1] "numeric"

To do this in R without dplyr the code is the following:

chowVals <- dat[ dat$Diet=="chow", colnames(dat)=="Bodyweight"]

Exercises

For these exercises, we will use a new dataset related to mammalian sleep. This data is described

here. Download the CSV file from this location:

We are going to read in this data, then test your knowledge of they key dplyr functions select

and filter. We are also going to review two different classes: data frames and vectors.

1. Read in the msleep_ggplot2.csv file with the function read.csv and use the function class

to determine what type of object is returned.

2. Now use the filter function to select only the primates. How many animals in the table are

primates? Hint: the nrow function gives you the number of rows of a data frame or matrix.

3. What is the class of the object you obtain after subsetting the table to only include primates?

4. Now use the select function to extract the sleep (total) for the primates. What class is this

object? Hint: use %>% to pipe the results of the filter function to select.

5. Now we want to calculate the average amount of sleep for primates (the average of the

numbers computed above). One challenge is that the mean function requires a vector so, if

we simply apply it to the output above, we get an error. Look at the help file for unlist and

use it to compute the desired average.

6. For the last exercise, we could also use the dplyr summarize function. We have not introduced

this function, but you can read the help file and repeat exercise 5, this time using just filter

and summarize to get the answer.

http://docs.ggplot2.org/0.9.3.1/msleep.html

Mathematical Notation

The R markdown document for this section is available here²³.

This book focuses on teaching statistical concepts and data analysis programming skills. We avoid

mathematical notation as much as possible, but we do use it. We do not want readers to be

intimidated by the notation though. Mathematics is actually the easier part of learning statistics.

²³https://github.com/genomicsclass/labs/tree/master/intro/math_notation.Rmd

剩余465页未读，继续阅读

moreCilantro

粉丝: 0
资源: 1

2015年生命科学数据挖掘革命：从基因组测序到统计分析

data analysis for the life sciences

gc2053_csp_datasheet for release_rev.1.0.pdf

皮尔逊相关系数参考文献

hisat2 -p 2 -x '/mnt/hgfs/H/DATA/RNAseq_analysis/genomic.fasta.fna' -U '/mnt/hgfs/H/DATA/RNAseq_analysis/SRR9429962.fasta.gz' -S hisat2_outdir.sam Warning: Invalid file format (ERR): "/mnt/hgfs/H/DATA/RNAseq_analysis/genomic.fasta.fna" does not exist Exi

[MY-010457] [Server] --initialize specified but the data directory has files in it. Aborting.

analysis.k_means.centers = analysis.k_means.data[:num]

6. What is the State chart Diagram? Explain the following. a. Process b.Data Flows c. Actor d.Data Stores

最新资源