随机森林分析中的人群结构校正方法

需积分: 5 17 浏览量更新于2024-09-07 收藏 291KB PDF 举报

"这篇文章主要探讨了在随机森林分析中如何校正人群结构(stratification)带来的影响，以确保基因组广泛关联研究(GWAS)的准确性。作者包括来自哈佛大学和南京医科大学的多位专家，他们指出人群结构是GWAS中的重要干扰因素，可能导致虚假的关联结果。随机森林作为一种机器学习方法，在GWAS数据处理中日益受到重视，但其在处理人口结构时可能存在偏差。文章可能详细阐述了校正策略和方法，以提高遗传关联分析的可靠性。" 在基因组广泛关联研究(GWAS)中，人群结构(stratification)和混合(admixture)是一个重要的混淆因素，可能导致对基因与疾病关联的错误解读。这是因为不同人群中基因变异的频率可能有所不同，如果在分析中未考虑这些差异，可能会误将这些群体差异解释为与表型的关联。随机森林(Random Forest, RF)是一种强大的机器学习算法，由于其在特征选择和模型构建上的优势，近年来在GWAS中得到了广泛应用。随机森林通过构建大量的决策树并综合它们的预测结果来做出决策，对于识别复杂关系和避免过拟合有显著效果。然而，当面临人群结构问题时，随机森林可能也会产生偏倚。因为随机森林在选择特征时可能会优先选取那些在不同群体间变异较大的基因位点，这可能导致对真实因果关系的忽视，或者引入假阳性关联。为了纠正这种偏倚，文章可能探讨了多种策略和方法。首先，可以采用预处理步骤，如PCA(主成分分析)或使用遗传图谱来估计和去除人群结构的影响。其次，研究人员可能提出在随机森林构建过程中集成人群结构信息，例如通过在每个决策树的节点划分时考虑个体的祖先信息。此外，调整权重分配或引入新的分裂标准也可能有助于减少人群结构的干扰。此外，文章可能还讨论了实际应用中的案例，展示了所提出方法在GWAS数据上的性能，并与其他校正方法进行了比较。校正人群结构对于确保遗传关联研究的科学性和有效性至关重要，因为这直接影响到我们对遗传风险因子的理解和疾病的遗传基础。 "Correction for population stratification in random forest analysis"这篇研究工作深入研究了如何在随机森林分析中有效地处理人群结构问题，以提高GWAS的分析质量，确保发现的真实关联不受人群差异的影响。这种方法的应用和发展对于遗传学研究领域具有重要意义，有助于推动更准确的遗传风险预测和疾病预防策略的制定。

Correction for population stratification in

random forest analysis

Yang Zhao,

1,2

Feng Chen,

Rihong Zhai,

Xihong Lin,

Zhaoxi Wang,

Li Su

and David C Christiani

Environmental and Occupational Medicine and Epidemiology Program, Department of Environmental Health, Harvard School of

Public Health, Harvard University, Boston, MA, USA,

Department of Epidemiology and Biostatistics, School of Public Health,

Nanjing Medical University, Nanjing, Jiangsu, China and

Department of Biostatistics, Harvard School of Public Health, Harvard

University, Boston, MA, USA

*Corresponding author. Environmental and Occupational Medicine and Epidemiology Program, Department of Environmental

Health, Harvard School of Public Health, Harvard University, 677 Huntington Avenue, Building 1, Room 1401, Boston, MA, USA.

E-mail: dchris@hsph.harvard.edu

Accepted 27 September 2012

Background Population structure (PS), including population stratification and

admixture, is a significant confounder in genome-wide association

studies (GWAS), as it may produce spurious associations. Random

forest (RF) has been increasingly applied in GWAS data analysis

because of its advantage in analysing high dimensional genetic

data. RF creates importance measures for single nucleotide poly-

morphisms (SNPs), which are helpful for feature selections.

However, if PS is not appropriately corrected, RF tends to give

high importance to disease-unrelated SNPs with different frequen-

cies of allele or genotype among subpopulations, leading to inaccur-

ate results.

Methods In this study, the authors propose to correct for the confounding

effect of PS by including the information of PS in RF analysis. The

correction procedure starts by extracting the information of PS

using EIGENSTRAT or multi-dimensional scaling clustering proced-

ure from a large number of structure inference SNPs. Phenotype

and genotypes adjusted by the information of PS are then used as

the outcome and predictors in RF analysis.

Results Extensive simulations indicate that the importance measure of the

causal SNP is increased following the PS correction. By analysing a

real dataset, the proposed correction removes the spurious associ-

ation between the lactase gene and height.

Conclusion The authors propose a simple method to correct for PS in RF ana-

lysis on GWAS data. Further studies in real GWAS datasets are

required to validate the robustness of the proposed approach.

Keywords Genome-wide association study, population stratification, random

forest

Introduction

Genome-wide association study (GWAS) is a powerful

tool to identify genetic markers with susceptibility to

complex diseases.

1–3

Traditional analysis methods for

population-based GWAS data, including Armitage’s

trend test, Pearson 

test and unconditional logistic

regression, are mainly based on the comparison of

allele or genotype frequencies. A single nucleotide

polymorphism (SNP) is suggested to be associated

Published by Oxford University Press on behalf of the International Epidemiological Association

International Journal of Epidemiology 2012;41:1798–1806

doi:10.1093/ije/dys183

1798

Downloaded from https://academic.oup.com/ije/article-abstract/41/6/1798/747149 by Iowa State University user on 09 October 2019

下载后可阅读完整内容，剩余8页未读，立即下载

一朵灿灿

粉丝: 14

随机森林分析中的人群结构校正方法

A simple atmospheric correction algorithm for MODIS in shallow turbid waters

Harmonic measurements, analysis, and power factor correction in a modern steel manufacturing.pdf

File Name Correction for Dynamics 365-crx插件

Correction and analysis of lead content in soil by laser-induced breakdown spectroscopy

Wide aperture piezoceramic deformable mirrors for aberration correction in high-power lasers

Dominant color extraction based color correction for multi-view images

段玉萍_Accurate MR Restoration with Correction for Intensity Inhomo

Coordinate difference homogenization matching method for motion correction in 3D range-intensity correlation laser imaging

Readout circuit with nonuniformity correction for the uncooled microbolometer* (2005年)

Attitude aberration correction for space technology experiment and climate exploration (STECE) satellite star tracker

最新资源