RNA-seq最佳实践全面解析：从实验设计到功能分析

需积分: 0 6 浏览量更新于2024-07-01 1 收藏 1.02MB PDF 举报

本文是一篇关于RNA-seq数据分析最佳实践的综述，由Ana Conesa、Pedro Madrigal等人合作撰写。RNA-seq作为一种广泛应用的测序技术，被广泛用于基因表达研究、转录本鉴定、小分子RNA分析以及与其他功能基因组学方法的整合。文章指出，由于RNA-seq数据的复杂性和多样性，没有一种单一的分析管道适用于所有情况，因此作者系统地探讨了整个RNA-seq数据分析流程的关键步骤。首先，实验设计阶段是关键，包括样本选择、处理方法和实验条件的标准化，以确保结果的可靠性和可重复性。接下来，质量控制（QC）环节对原始数据进行预处理，排除低质量读取，确保后续分析的准确性。在读取比对阶段，文章强调了不同的软件工具和参数设置对正确识别转录本（gene transcripts）的重要性，如星图（STAR）、TopHat等。定量基因和转录本表达水平时，需要选择适当的算法，如Cufflinks、HTSeq-count或DESeq2，这些工具会考虑转录本结构和内含子信息。可视化是理解数据的重要手段，文中讨论了诸如Illumina's Genome Browser、R包（如edgeR和DESeq2的plot功能）在内的工具，用于展示表达差异、聚类分析和动态图谱。文章特别关注了差异基因表达分析，这是许多研究的核心，通过对比不同条件下的样本，寻找那些显著上调或下调的基因。此外，还介绍了转录剪接变异（alternative splicing）的检测方法，这对于理解基因表达调控具有重要意义。功能分析则涉及到对基因功能的解读，如富集分析（enrichment analysis）和路径分析，以便揭示生物学过程和信号通路与表达模式的关系。同时，文中也涉及到了基因融合事件的检测，这对于某些类型的疾病研究具有价值。最后，随着技术的进步，作者探讨了如何将RNA-seq与其他基因组学技术（如ChIP-seq、ATAC-seq等）结合，以及新技术（如单细胞RNA-seq和单核RNA-seq）对转录组学领域的影响，以及它们在个性化医疗和复杂生物系统的理解中的潜在应用。本文详尽阐述了RNA-seq数据分析的各个核心环节及其挑战，旨在为研究人员提供一套全面的指南，帮助他们根据具体研究需求选择合适的分析策略，从而最大化数据的价值。

experimental design, especially when the experiment in-

volves a large number of samples that need to be proc-

essed in several batches. In this case, including controls,

randomizing sample processing and smart management

of sequencing runs are crucial to obtain error-free data

(Fig. 1a; Box 2).

Analysis of the RNA-seq data

The actual analysis of RNA-seq data has as many varia-

tions as there are applications of the technology. In this

section, we address all of the major analysis steps for a

typical RNA-seq experiment, which involve quality con-

trol, read alignment with and without a reference genome,

obtaining metrics for gene and transcript expression, and

approaches for detecting differential gene expression. We

also discuss analysis options for applications of RNA-seq

involving alternative splicing, fusion transcripts and small

RNA expression. Finally, we review useful packages for

data visualization.

Quality-control checkpoints

The acquisition of RNA-seq data consists of several

steps — obtaining raw reads, read alignment and quanti-

fication. At each of these steps, specific checks should

be applied to monitor the quality of the data (Fig. 1a).

Raw reads

Quality control for the raw reads involves the analysis of

sequence quality, GC content, the presence of adaptors,

overrepresented k-mers and duplicated reads in order to

detect sequencing errors, PCR artifacts or con tamina-

tions. Acceptable duplication, k-mer or GC content

levels are experiment- and organism-specific, but these

values should be homogeneous for samples in the same

experiments. We recommend that outliers with over

30 % disagreement to be discarded. FastQC [11] is a

popular tool to perform these analyses on Illumina

reads, whereas NGSQC [12] can be applied to any plat-

form. As a general rule, read quality decreases towards

the 3’ end of reads, and if it becomes too low, ba ses

should be removed to improve mappability. Software

tools such as the FASTX-Toolkit [13] and Trimmomatic

[14] can be used to discard low-qua lity reads, trim

adaptor sequences, and eliminate poor-quality bases.

Read alignment

Reads are typically mapped to either a genome or a tran-

scriptome, as will be discussed later. An important map-

ping quality parameter is the percentage of mapped

reads, which is a global indicator of the overall sequen-

cing accuracy and of the presence of contaminating

DNA. For example, we expect between 70 and 90 % of

regular RNA-seq reads to map onto the human genome

(depending on the read mapper used) [15], with a sig-

nificant fraction of reads mapping to a limited number

of identical regions equally well (‘multi-mapping reads’).

When reads are mapped against the transcriptome, we

expect slightly lower total mapping percentages because

reads coming from unannotated transcripts will be lost,

and significantly more multi-mapping reads because of

reads falling onto exons that are shared by different

transcript isoforms of the same gene.

Other important parameters are the uniformity of read

coverage on exons and the mapped strand. If reads

Box 2. Experiment execution choices

RNA-seq library preparation and sequencing procedures include

a number of steps (RNA fragmentation, cDNA synthesis, adapter

ligation, PCR amplification, bar-coding, and lane loading) that

might introduce biases into the resulting data [196]. Including

exogenous reference transcripts (‘spike-ins’) is useful both for

quality control [1, 197] and for library-size normalization [198].

For bias minimization, we recommend following the suggestions

made by Van Dijk et al. [199], such as the use of adapters with

random nucleotides at the extremities or the use of chemical-based

fragmentation instead of RNase III-based fragmentation. If the

RNA-seq experiment is large and samples have to be processed in

different batches and/or Illumina runs, caution should be taken to

randomize samples across library preparation batches and lanes so

as to avoid technical factors becoming confounded with

experimental factors. Another option, when samples are individually

barcoded and multiple Illumina lanes are needed to achieve the

desired sequencing depth, is to include all samples in each lane,

which would minimize any possible lane effect.

Table 1 Statistical power to detect differential expression varies

with effect size, sequencing depth and number of replicates

Replicates per group

3510

Effect size (fold change)

1.25 17 % 25 % 44 %

1.5 43 % 64 % 91 %

2 87 % 98 % 100 %

Sequencing depth (millions of reads)

3 19% 29% 52%

10 33 % 51 % 80 %

15 38 % 57 % 85 %

Example of calculations for the proba bility of detecting differential expression

in a single test at a significance level of 5 %, for a two-group comparison using

a Negative Binomial model, as computed by the RNASeqPower package of

Hart et al. [190]. For a fixed within-group variance (package default value), the

statistical power increases with the difference between the two groups (effect

size), the sequencing depth, and the number of replicates per group. This

table shows the statistical power for a gene with 70 aligned reads, which was

the median coverage for a protein-codin g gene for one whole-blood RNA-seq

sample with 30 million aligned reads from the GTEx Project [214]

Conesa et al. Genome Biology (2016) 17:13 Page 4 of 19

剩余18页未读，继续阅读

丽龙

粉丝: 29

RNA-seq最佳实践全面解析：从实验设计到功能分析

RNAseq数据分析：salmon_deseq2_snakemake存储库

chenlab_rnaseq_pipeline：使用Toil部署RNA Seq分析流程

DE-nf: RNAseq数据分析流程的Nextflow实现

rnaseq_demystified_workshop_2021

matlab如何敲代码-TCGA_RNASeq_Clinical:TCGA_RNASeq_Clinical

rnaseq_variant_calling_workflow:这是遵循GATK管道的人类RNAseq变体调用工作流程。 还包括ADAR站点消除

tophat_cufflinks_rnaseq:RNA-Seq 分析流水线

rnaseq:RNA-seq分析

RNAseq_ChIPseq_course:使用PAF1 Cell 2015进行RNA-seq和ChIP-seq数据分析简介

chenlab_rnaseq_pipeline:来自Toil的chenlab RNA Seq管线

最新资源

rnaseq_variant_calling_workflow:这是遵循GATK管道的人类RNAseq变体调用工作流程。还包括ADAR站点消除