1 Simulate a dataset containing continuous traits with given ℎ 2 . Split it into training and testing datasets with proportion 4 ∶ 1. set.seed(20132014) n <- 5000 p <- 1000 h <- c(0.2, 0.8)[1] # simulate genotype (not exactly) x_r <- matrix(rnorm(n * p), ncol = p) xmean <- matrix(rep(colMeans(x_r), n), ncol=p, byrow = TRUE) xsd <- matrix(rep(apply(x_r, 2, sd), n), ncol=p, byrow = TRUE) x <- (x_r - xmean)/xsd # ... 2 Please perform the marginal model. ## [1] 526 776 50 557 801 345 298 559 178 214 1 2 −0.025 0.000 0.025 0.050 0 250 500 750 1000 snp u • Show the locations of significant SNPs. • Replicate above figure, and highlight the significant coefficents in red
时间: 2024-03-26 12:34:52 浏览: 61
For the first question, I understand that you want me to simulate a dataset with continuous traits and split it into training and testing datasets with a given heritability (h2) and some specific code. Here is the modified code to simulate the dataset and split it into training and testing datasets:
```R
set.seed(20132014)
n <- 5000
p <- 1000
h2 <- 0.2
# simulate genotype (not exactly)
x_r <- matrix(rnorm(n * p), ncol = p)
xmean <- matrix(rep(colMeans(x_r), n), ncol=p, byrow = TRUE)
xsd <- matrix(rep(apply(x_r, 2, sd), n), ncol=p, byrow = TRUE)
x <- (x_r - xmean)/xsd
# simulate phenotype
g <- colSums(x) * sqrt(2 * h2 / p) # genetic effect
e <- rnorm(n) * sqrt(1 - h2) # environmental effect
y <- g + e # phenotype
# split into training and testing datasets
train_idx <- sample(1:n, size = n * 0.8, replace = FALSE)
x_train <- x[train_idx, ]
y_train <- y[train_idx]
x_test <- x[-train_idx, ]
y_test <- y[-train_idx]
```
This code simulates a dataset with n = 5000 individuals and p = 1000 SNPs. The heritability is set to 0.2. The code first simulates the genotype matrix x, and then uses it to simulate the phenotype y. The genetic effect is proportional to the sum of each SNP's effect, and the environmental effect is simulated from a normal distribution. Finally, the code splits the dataset into training (80%) and testing (20%) datasets.
For the second question, I understand that you want me to perform a marginal model and then show the locations of significant SNPs and replicate a figure highlighting significant coefficients in red. Here is the code to perform the marginal model and visualize the results:
```R
library(glmnet)
set.seed(1234)
# fit marginal model
cvfit <- cv.glmnet(x_train, y_train, alpha = 0)
lam <- cvfit$lambda.min
fit <- glmnet(x_train, y_train, alpha = 0, lambda = lam)
# identify significant SNPs
coef <- coef(fit)
nz_idx <- which(coef != 0)
nz_snps <- nz_idx - 1
# visualize results
par(mar = c(5, 4, 4, 8) + 0.1)
plot(coef, xvar = "lambda", label = TRUE, main = "Marginal Model")
abline(v = lam, lty = 2)
significant_snps <- which(abs(coef) > 0.1)
points(coef[significant_snps], col = "red", pch = 19, cex = 1.2)
```
This code fits a marginal model using the glmnet package and identifies significant SNPs based on a threshold of 0.1. The code then visualizes the results using the plot function and highlights the significant SNPs in red. The abline function adds a vertical line at the optimal lambda value selected by cross-validation.
阅读全文