mutate(n=row_number())
时间: 2023-08-04 21:34:36 浏览: 74
这个表达式出现在 R 语言中的 dplyr 包中的 mutate() 函数中,其作用是为数据框中的每一行添加一个名为 "n" 的新变量,并为每一行赋予一个递增的序号,从 1 开始。这个序号可以按照数据框中的指定列进行排序。这个函数通常用于数据框的预处理,为数据框添加新的变量,以及为数据框的每一行添加一个唯一标识符。
相关问题
library(dplyr) data <- data %>% mutate(startlabel = as.character(start.station.id), endlabel = as.character(end.station.id)) %>% group_by(start.station.id, end.station.id) %>% mutate(label = row_number()) %>% ungroup() %>% mutate(startlabel = ifelse(duplicated(startlabel), paste0(startlabel, ".", endlabel), startlabel), endlabel = ifelse(duplicated(endlabel), paste0(endlabel, ".", start.station.id), endlabel)) %>% group_by(start.station.id) %>% mutate(startlabel = dense_rank(startlabel)) %>% ungroup() %>% group_by(end.station.id) %>% mutate(endlabel = dense_rank(endlabel)) %>% ungroup() %>% mutate(startlabel = ifelse(is.na(startlabel), max(endlabel, na.rm = TRUE) + row_number(), startlabel), endlabel = ifelse(is.na(endlabel), max(startlabel, na.rm = TRUE) + row_number(), endlabel)) %>% select(-label)
这段代码使用了 R 语言中的 dplyr 包来进行数据处理操作。它对一个名为 `data` 的数据框进行了一系列变换操作,下是对每个操作简要说明:
1.mutate(startlabel = as.character(start.station.id), endlabel = as.character(end.station.id))`:将 `start.station.id` 和 `end.station.id` 列的值转换为字符类型,并将结果存储在 `startlabel` 和 `endlabel` 列中。
2. `group_by(start.station.id, end.station.id)`:按照 `start.station.id` 和 `end.station.id` 列对数据进行分组。
3. `mutate(label = row_number())`:为每个组内的观测添加一个新的 `label` 列,值为组内观测的行号。
4. `ungroup()`:取消分组,将数据恢复为未分组状态。
5. `mutate(startlabel = ifelse(duplicated(startlabel), paste0(startlabel, ".", endlabel), startlabel), endlabel = ifelse(duplicated(endlabel), paste0(endlabel, ".", start.station.id), endlabel))`:对于重复的 `startlabel` 和 `endlabel`,在其后添加对应的值,以区分它们。
6. `group_by(start.station.id)`:按照 `start.station.id` 列对数据进行再次分组。
7. `mutate(startlabel = dense_rank(startlabel))`:对每个组内的 `startlabel` 进行密集排名,即为每个不同的值分配一个唯一的整数。
8. `ungroup()`:取消分组。
9. `group_by(end.station.id)`:按照 `end.station.id` 列对数据进行分组。
10. `mutate(endlabel = dense_rank(endlabel))`:对每个组内的 `endlabel` 进行密集排名,即为每个不同的值分配一个唯一的整数。
11. `ungroup()`:取消分组。
12. `mutate(startlabel = ifelse(is.na(startlabel), max(endlabel, na.rm = TRUE) + row_number(), startlabel), endlabel = ifelse(is.na(endlabel), max(startlabel, na.rm = TRUE) + row_number(), endlabel))`:如果 `startlabel` 或 `endlabel` 中有缺失值,将缺失值用最大值加上行号来填充。
13. `select(-label)`:删除 `label` 列。
请注意,这只是代码的解释,并非实际运行该代码的环境。如果你想运行该代码,请确保已经安装了 dplyr 包,并且数据框 `data` 中包含了对应的列。
用r语言写程序:Simulate a dataset containing continuous traits with given h^2. Split itinto training and testing datasets with proportion 4 ∶ 1.set.seed(20132014)n <- 5000p <- 1000h <- c(0.2, 0.8)[1]# simulate genotype (not exactly)x_r <- matrix(rnorm(n * p), ncol = p)xmean <- matrix(rep(colMeans(x_r), n), ncol=p, byrow = TRUE)xsd <- matrix(rep(apply(x_r, 2, sd), n), ncol=p, byrow = TRUE)x <- (x_r - xmean)/xsdPlease perform the marginal model.## [1] 526 776 50 557 801 345 298 559 178 214 • Show the locations of significant SNPs.• Replicate above figure, and highlight the significant coefficents in red.
Sure, here is the R code to simulate the dataset, split it into training and testing datasets, and perform the marginal model:
```R
# Simulate dataset
set.seed(20132014)
n <- 5000
p <- 1000
h2 <- 0.2
x_r <- matrix(rnorm(n * p), ncol = p)
xmean <- matrix(rep(colMeans(x_r), n), ncol=p, byrow = TRUE)
xsd <- matrix(rep(apply(x_r, 2, sd), n), ncol=p, byrow = TRUE)
x <- (x_r - xmean)/xsd
beta <- rnorm(p, mean = 0, sd = sqrt(h2/p))
y <- x %*% beta + rnorm(n, mean = 0, sd = sqrt(1 - h2))
# Split into training and testing datasets
train_idx <- sample(1:n, size = n/5*4, replace = FALSE)
x_train <- x[train_idx, ]
y_train <- y[train_idx]
x_test <- x[-train_idx, ]
y_test <- y[-train_idx]
# Perform marginal model
library(glmnet)
fit <- glmnet(x_train, y_train, family = "gaussian")
cvfit <- cv.glmnet(x_train, y_train, family = "gaussian")
lambda_min <- cvfit$lambda.min
beta_hat <- coef(fit, s = lambda_min)[-1, ]
```
To show the locations of significant SNPs, we can plot the absolute values of the estimated coefficients and highlight the significant ones in red:
```R
# Plot coefficients
library(ggplot2)
library(dplyr)
beta_df <- data.frame(beta_hat) %>%
mutate(SNP = row_number()) %>%
gather(key = "beta", value = "value", -SNP) %>%
mutate(value = abs(value))
ggplot(beta_df, aes(x = SNP, y = value)) +
geom_point() +
scale_y_log10() +
labs(title = "Estimated Coefficients",
x = "SNP",
y = "Absolute Value") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
# Highlight significant coefficients
significant_snps <- which(beta_hat != 0)
significant_df <- beta_df %>%
filter(SNP %in% significant_snps) %>%
mutate(significant = TRUE)
ggplot(beta_df, aes(x = SNP, y = value)) +
geom_point() +
geom_point(data = significant_df, aes(color = significant), size = 2) +
scale_y_log10() +
scale_color_manual(values = c("black", "red")) +
labs(title = "Estimated Coefficients",
x = "SNP",
y = "Absolute Value") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
```
This should give you the desired plot with significant coefficients highlighted in red.
相关推荐
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)