精准医学视角下的转化生物医学信息学

需积分: 9 23 浏览量更新于2024-07-19 收藏 6.28MB PDF 举报

"Translational Biomedical Informatics" 是一本关于转化生物医学信息学的书籍，由Bairong Shen、Haixu Tang和Xiaoqian Jiang编辑，它提供了精准医学视角下的深入洞察。这本书着重讨论了下一代测序（Next-Generation Sequencing, NGS）技术在个人基因组测序、基因组景观表征以及大量序列变异检测中的应用。书中概述了从NGS数据中识别不同类型的序列变异的一般方法，并总结了分析和可视化与复杂疾病相关的致病性变异的常用策略。在精准医学的时代背景下，NGS技术的进步为疾病诊断和治疗带来了革命性的变化。通过高通量测序，科学家们能够快速且高效地解析个体的基因组，从而发现可能导致疾病的遗传变异。本书章节中，作者详细介绍了序列变异检测的过程，这通常包括质量控制、比对、变异呼叫和功能注释等步骤。这些步骤对于理解基因组中的变异如何影响健康和疾病至关重要。 NGS数据的分析涉及到多种方法，如短读比对工具（如BWA或Bowtie），变异检测工具（如GATK或FreeBayes），以及注释工具（如ANNOVAR或SnpEff）。这些工具帮助研究人员从海量的测序数据中提取有意义的信息，识别可能的致病变异。此外，书中还可能涵盖了如何使用各种生物信息学软件和数据库来验证和解读这些变异，以及如何将这些信息转化为临床实践的决策支持。对于复杂的疾病，如癌症或多基因遗传病，分析和可视化工作变得更为复杂。书中提到，通常需要集成多个数据源，包括基因表达数据、表观遗传学数据和临床信息，以确定疾病的多因素模型。这一过程可能涉及到网络分析、机器学习算法以及可视化工具（如Cytoscape或IGV），以揭示疾病相关变异的复杂网络和潜在的治疗靶点。 “Translational Biomedical Informatics”深入探讨了NGS在转化医学中的应用，特别是如何利用这些技术进行序列变异分析，以推动精准医学的发展。通过对这些技术的了解，读者可以更好地理解如何将基因组学研究转化为改善患者预后的实际医疗策略。这本书对于生物医学研究人员、临床医生以及对精准医学感兴趣的学者来说是一份宝贵的资源。

1.2.4 Structural Variant Discovery

Structural variants (SVs) are widespread in human genomes and play important

roles in the development of human diseases. As the growing number of SVs has

been demonstrated to have clinical relevance, SV discovery is critical in precision

medicine and cancer genomics. NGS technologies have revolutionized SV studies.

Compared to traditional hybridization-based approaches such as array CGH and

SNP microarrays, sequencing-based bioinformatics methods can detect multiple

types of SVs at a wide size range [5 ]. Most of these approaches distinguish SVs

based on two read mapping signatures including depth of coverage and paired-end

mapping [39]. The ﬁrst type of approaches searches the regions with abnormal read

counts; the second type of tools investigates the conﬁgurations of the paired-end

mappings [60]. In this section, we describe the computational approaches

(Table 1.2) based on the two signatures below.

Depth of Coverage The approaches assume that read mapping follows a Po isson

distribution and the divergence from this distribution indicates the SV signatures.

The duplication has more reads mapping to region, and deletions show signiﬁcantly

reduced coverage. CNVnator [2] can detect the deletions and duplications using a

statistical analysis of read mapping density for single-end and paired-end reads. It

captures the read-depth signatures by dividing sequencing regions into equal-sized

bins and computing the counts of reads in each bin. The partitioning of the

signatures is based on a mean-shift approach with additional ﬁlters such as

GC-bias correctio n. The statistical signiﬁcance test is used to identify the regions

with abnormal signals for detecting possible deletions or duplications. The read-

depth approaches can predict the absolute copy numbers of genomic segments.

However, they cannot detect the balanced SVs such as translocations and

inversions.

Paired-End Mapping The approaches can be classiﬁed into two types of strategy:

read pair and split read. Read-pair methods analyze the span and orientation of

paired-end reads and identify the read pairs that are mapped with discordant

separation distances or orientation. Read-pair approaches can detect all classes of

SVs. BreakDancer [12] can detect read pairs with mapping span and orientation that

are inconsistent with the control. It has two models: BreakDancerMax can identify

ﬁve types of SVs including insertion, deletions, inversions, and intrachromosomal

and interchromosomal translocations, while BreakDancerMini is used to detect

INDELs. Split-read approaches are used to search split-read signatures to identify

the breakpoints of SVs. The deletions and duplications can be identiﬁed from the

continuous stretch of gaps in the sequence reads or references, respectively. Split-

read methods are suitable for long reads, but some algorithms can use short reads to

identify the breakpoints of large SVs. For example, Pindel [61] uses a pattern

growth algorithm to ﬁnd large deletions and medium insertions from short paired-

end reads. The algorithm can align the gapped short sequences to reference

1 NGS for Variants 9

sequences with local alignment, which can reduce memory and increase speed for

searching potential split reads.

Structure variant discovery from targeted or whole-exome sequencing data is very

challenging due to the noncontiguous reads in exons. The targeted sequencing results

in some biases in sample collection, targeted genomic hybridization, and GC content.

Multiple tools have been developed to overcome these biases. CONTRA [36]isa

read-depth tool for CNV discovery. It uses BAM/SAM alignments as inputs and

builds an average baseline across multiple samples as the control. CONTRA then

computes the base-level log-ratios with corrections for imbalanced library size bias

and GC content bias. It calculates two-tailed P-values to detect CNVs. XHMM [18]

applies principal component analysis to normalize read depth in targets. It uses

hidden Markov model (HMM) to detect CNVs across multiple samples (>50 sam-

ples). In addition to VCF ﬁles, Browser Extensible Data (BED) format ﬁles can be

used to store and display large structural variants for further analysis.

1.3 Variant Analysis

Causal variant discovery is the key step in precision medicine informatics. Identi-

fying the disease-related variants promises to dramatically expand current aspects of

biomedical research in disease diagnostics and drug design. Multiple bioinformatics

tools have been developed to distinguish the causal variants associated with human

diseases from the massive number of nonfunctional variants detected by NGS variant

callers. Annotation methods determine the possible functional impact of all identiﬁed

variants. Association analyses connect the variants with complex diseases or clinical

traits. Visualization tools provide the graphic views of identiﬁed causal variants. The

disease-related casual variants can be identiﬁed by combining these approaches and

stored in public variant databases such as ClinVar [25] and HGMD [54]. The Human

Variome Project (http://www.humanvariomeproject.org/) has curated the gene-/dis-

ease-speciﬁc databases to collect the sequence variants and genes associated with

diseases. In this section, we summarize the variant analysis approaches for identify-

ing the most promising causal variants underlying human diseases.

1.3.1 Variant Annotation

Variant annotation can be used to determine the effects of sequence variants on

genes and proteins and ﬁlter the functional important variants from a background of

neutral polymorphisms. Coding mutations, such as nonsynonymous SNV s, could

change amino acid sequences and affect protein structures and functions. They are

more likely to be involved in the development of diseases. Regulatory variants

located in noncoding regions could modu late the gene expressions and work as the

causative modiﬁers of human diseases. Here, we describe the common

10 S. Teng

computational tools for predicting the effects of coding mutations and regulatory

variants. We also introduce the generally used annotation toolkits to access the

prediction results generat ed from these tools.

Damaging Nonsynonymous Mutation Prediction With the advent of NGS technol-

ogies, particularly of exome sequencing, there is a signiﬁcant need to interpret the

coding variants. A number of tools have been developed to distinguish deleterious

mutations from a large number of harmless nonsynonymous polymorphisms.

Sorting Intolerant From Tolerant (SIFT) [41] is a commonly used method for

predicting the effects of coding mutations on protein function. The algori thm

assumes that important protein sites should be conserved throughout evolution

and mutations located in these sites could alter protein functions. SIFT searches

the target sequence in protein database and constructs the sequence alignments

using closely related sequences. It computes the degree of conservation of protein

residues to distinguish the deleterious and neutral coding mutations. Polymorphism

Phenotyping v2 (PolyPhen2) [4] is another popular tool for predicting deleterious

missense mutations. The PolyPhen2 prediction is based on sequence annotations,

structural attributes, and comparative evolutionary considerations. PolyPhen2 uses

an iterative greedy algorithm to extract sequence-based and structure-based fea-

tures. Then, it constructs the supervised machine learning classiﬁers to predict

missense variants as benign, possibly damaging, or probably damaging mutations.

PolyPhen2 uses two data sets (HumDiv and HumVar) for training. HumDiv data set

collects all damaging mutations associated with human Mendelian diseases from

UniProtKB and non-damaging mutations between the proteins and their closely

related mammalian homologs. HumDiv model can be used to analyze rare variants

mildly deleterious at functionally important regions such as the regions involved in

complex phenotypes or identiﬁed from genome-wide association studies (GWAS).

HumVar data set uses all disease-causing mutations from UniProtKB as positive data

and the common sequence variants not involved in disease as negative instances.

HumVar model can be used to identify the damaging mutations with signiﬁcant

effects for Mendelian disease research. Other common in silico programs include

likelihood ratio test (LRT) [13], which identiﬁes the damaging mutations that disrupt

signiﬁcantly conserved amino acid positions within the human proteome, and

MutationTaster [51] which evaluates the deleterious sequence variants using a

naive Bayesian model constructed from features including splice-site alterations,

mRNA changes, loss of protein, and evolutionary conservation.

Regulatory Variant Effect Prediction The majority of disease-related variant hits

identiﬁed from GWAS fall in noncoding DNA region, which indicate the regulatory

variants located in noncoding regions are critical in human disease. Regulatory

variants play important roles in gene expression and protein modiﬁcation. Several

bioinformatics tools have been developed for predicting the functional effect s of

regulatory variants. Genome-wide annotation of variants (GWAVA) [45] uses a

random forest algorithm to construct three classiﬁers to distinguish the functional

sequence variants in regulatory regions from a background of neutral variants. The

classiﬁers integrate genomic features such as evolutionary conservation and GC

content and range of epigenomic annotations from the Encyclopedia of DNA

1 NGS for Variants 11

Elements (ENCODE) project [15]. Combined Annotation Dependent Depletion

(CADD) [22] is a score that can be used to prioritize the functional variants

including coding variants and regulatory variants. CADD tool constructs support

vector machine classiﬁers to integrate various genomic and epigenomic annotations

into a single measure (C score) for each sequence variant. Recently, deep learning

algorithm has been applied for interpretat ion of regulatory variants. DeepSEA [62]

is a deep learning-based tool for predicting the effects of noncoding variant and

prioritizing regulatory variants. The software uses deep learning algorithms to learn

regulatory sequence code from large-scale chromati n-proﬁling data and predict the

effects of noncoding variants on chrom atin accessibility such as DNase I sensitiv-

ities, transcription factor binding, and histone marks at regulatory elements.

General Variant Annotation Multiple annotation toolkits have been developed to

determine the impacts of sequence variants on genes and proteins and access their

functional effects from above predictors. ANNOVAR [58] is a command-line Perl

software for annotating SNVs and INDELs based on genes, regions, or ﬁlters. In

gene-based annotation, it can annotate whether sequence variants affect protein

amino acid sequences (nonsense, missense, splice site, etc.). In region-based annota-

tion, it can identify the variants located in ENCODE-annotated regions such as

transcribed regions, enhancer regions, DNase I hypersensitivity sites, transcription

factor binding site, and transcription factor ChIP-Seq data. In ﬁlter-based annotation,

ANNOVAR can extract the information (allele frequency and identiﬁer) of a

sequence variant in public databases such as dbSNP [53], ClinVar [25], 1000

Genomes Project [1], and Exome Variant Server (http://evs.gs.washington.edu/

EVS/). In addition, it can be used to access the annotations from damaging mutation

predictors (SIFT, PolyPhen2, LRT, MutationTaster, etc.) for nonsynonymous muta-

tions and CADD for regulatory variants. SnpEff [14] is another popular annotation

package to estimate the functional effects of SNVs, INDELs, and multiple nucleotide

polymorphisms. Based on the functional impacts of the sequence variants, SnpEff

classiﬁes the variants to four classes: high, moderate, low, and modiﬁer. It also

provides the annotations for regulatory variants. SnpEff provides a summary

HTML page to display overall statistics for sequences and variants (Table 1.3).

1.3.2 Variant Association Testing

Understanding how genetic variants contribute to diseases is the key challenge in

precision medicine. There are two hypotheses for interpreting the genetic contri-

bution of sequence variants in complex diseases such as cancers and mental

disorders [50]. The “common disease–common variant” hypothesis states that a

few common variant s, usually deﬁned as the allele frequency greater than 1 % in the

population, make the major contributions for the genetic variance in complex

disease susce ptibility. In contrast, the “common disease–rare variant” hypothesis

argues that multipl e risk variants, each of which has low frequency (e.g., allele

frequency less than 1 %) in the population, are the major contributors to the genetic

12 S. Teng

susceptibility to complex diseases. NGS technologies can detect the full spectrum

of sequence variants including the rare variants that are difﬁcult to be captured by

traditional genotyping arrays. Here, we describe the generally u sed case–control

association approaches for common and rare variants.

Case–Control Data QC The ﬁrst step in any case-cont rol association ana lysis is the

data qu ality control [6]. The sampl es and variants with poo r quality should be removed

to reduc e the numbers of false-positive and false-negative associations . The sam ples

with outlying heterozygosity rat es, high missing data rates, and di scordant se x infor-

mation hav e poor qua lity and shou ld be removed ﬁrstly. In addition, the related

samples or samples from divergent ancestry should not be used for case-control

analysis. If the va riants showed a high rate of missing ge notypes, departur e from

Hardy–Weinberg equilibrium, or a different missing genotype rate between cases and

controls, these variants should be excluded from case-control analysis.

Common-Variant Association Analysis The genome-wide association study

(GWAS) is a generally used approach to identify the common variants associated

Table 1.3 Variant annotation tools

Tool Description URL Reference

Damaging nonsynonymous mutation prediction

SIFT Tool can predict deleterious and neutral

mutations based on sequence homology

http://sift.jcvi.org/ [41]

PolyPhen2 Tool can predict probably damaging,

possibly damaging, and benign muta-

tions based on sequence and structure

features

http://genetics.bwh.

harvard.edu/pph2/

[4]

LRT Tool can predict deleterious, neutral, or

unknown mutations using likelihood

ratio test

http://www.genetics.

wustl.edu/jﬂab/lrt_

query.html

[13]

MutationTaster Tool can predict disease-causing and

polymorphism mutations using naive

Bayesian model

http://www.

mutationtaster.org/

[51]

Regulatory variant effect prediction

GWAVA Tool can predict the regulatory variant

effects using random forest algorithm

https://www.sanger.

ac.uk/sanger/

StatGen_Gwava

[45]

CADD Tool can predict the effects of coding

and noncoding variants using support

vector machine algorithm

http://cadd.gs.wash

ington.edu/

[22]

DeepSEA Tool can predict the regulatory variant

effects using deep learning algorithm

http://deepsea.

princeton.edu/

[62]

General variant annotation

ANNOVAR Perl annotation toolkit based on genes,

regions, and ﬁlters

http://annovar.

openbioinformatics.

org/

[58]

SnpEff Java annotation package based on genes http://snpeff.

sourceforge.net/

[14]

1 NGS for Variants 13

剩余330页未读，继续阅读

XQuerySQLXML

粉丝: 0
资源: 1

精准医学视角下的转化生物医学信息学

Big Data Analysis for Bioinformatics and Biomedical Discoveries 无水印pdf 0分

Timoshenko_MetaBeam_Translational_Resonators_timoshenko_metamate

translational-contact-area:计算平移接触面积

Characterization of Post-translational Modifications in κ-casein macropeptide

matlab代码影响-Translational-Neuromodeling-final-project:TNM课程中最终项目的代码

java 通过代码让事务提交和translational

基于对知乎热榜话题的数据抓取分析与可视化python实现源码+文档说明（高分完整项目）

电子技术课程 电路分析技术 12 非正弦周期电流电路及电路频率特性 共43页.pptx

(完整数据)全国及各省森林覆盖率、森林面积，700个城市绿地面积、绿化率等数据

最新资源

电子技术课程电路分析技术 12 非正弦周期电流电路及电路频率特性共43页.pptx