R语言 根据数据框中的最高值将数据框列名分配给特定组

3lxsmp7m  于 2023-05-11  发布在  其他
关注(0)|答案(5)|浏览(122)

我想根据列的最高值将列分配给特定的组,即Cluster1、Cluster2或Cluster3。功能建议(dplyr?)来看看是赞赏。

  1. Group Sample1 Sample2 Sample3
  2. 1 Cluster1 0.1 0 0.1
  3. 2 Cluster2 0.4 0.3 0.01
  4. 3 Cluster3 0 0.9 0.92

预期产量

  1. Sample1 Cluster2
  2. Sample2 Cluster3
  3. Sample3 Cluster3
  1. df <- structure(list(Group = c("Cluster1", "Cluster2", "Cluster3"),
  2. Sample1 = c(0.1, 0.4, 0), Sample2 = c(0, 0.3, 0.9), Sample3 = c(0.1,
  3. 0.01, 0.92)), class = "data.frame", row.names = c("1", "2", "3"))
pexxcrt2

pexxcrt21#

转换为长优先并聚合,即

  1. library(dplyr)
  2. library(tidyr)
  3. df %>%
  4. pivot_longer(-1) %>%
  5. group_by(name) %>%
  6. summarise(Group = Group[value == max(value)])
  7. # A tibble: 3 × 2
  8. name Group
  9. <chr> <chr>
  10. 1 Sample1 Cluster2
  11. 2 Sample2 Cluster3
  12. 3 Sample3 Cluster3
plicqrtu

plicqrtu2#

使用tidyverse,您可以堆叠这些Sample列,然后按组堆叠slice_max()

  1. library(dplyr)
  2. library(tidyr)
  3. df %>%
  4. pivot_longer(-Group) %>%
  5. slice_max(value, by = name)
  6. # # A tibble: 3 × 3
  7. # Group name value
  8. # <chr> <chr> <dbl>
  9. # 1 Cluster2 Sample1 0.4
  10. # 2 Cluster3 Sample2 0.9
  11. # 3 Cluster3 Sample3 0.92

您可以通过调整slice_max中的参数with_ties(默认为TRUE)来决定是否保留***ties***。

txu3uszq

txu3uszq3#

我们可以使用which.maxsummarise,如果需要,最后使用pivot_longer。这样,我们就不必group_by,如果sample列太多的话,速度会更慢

  1. library(dplyr)
  2. library(tidyr)
  3. df |> summarise(across(starts_with("sample"),
  4. ~Group[which.max(.x)])) |>
  5. pivot_longer(everything())
  6. # A tibble: 3 × 2
  7. name value
  8. <chr> <chr>
  9. 1 Sample1 Cluster2
  10. 2 Sample2 Cluster3
  11. 3 Sample3 Cluster3
eimct9ow

eimct9ow4#

使用max.col + t,我们可以创建一个类似于

  1. data.frame(
  2. Sample = names(df)[-1],
  3. Group = df$Group[max.col(t(df[-1]))]
  4. )

它产生

  1. Sample Group
  2. 1 Sample1 Cluster2
  3. 2 Sample2 Cluster3
  4. 3 Sample3 Cluster3
lhcgjxsq

lhcgjxsq5#

另一种方法是在lapply中使用which.max来子集df$Group

  1. cbind(lapply(df[-1], \(x) df$Group[which.max(x)]))
  2. # [,1]
  3. #Sample1 "Cluster2"
  4. #Sample2 "Cluster3"
  5. #Sample3 "Cluster3"

或者使用vapply

  1. cbind(vapply(df[-1], \(x) df$Group[which.max(x)], ""))

或者使用索引并创建一个data.frame

  1. data.frame(Sample = names(df)[-1],
  2. Group = df$Group[vapply(df[-1], which.max, 0L)])
  3. # Sample Group
  4. #1 Sample1 Cluster2
  5. #2 Sample2 Cluster3
  6. #3 Sample3 Cluster3

基准

  1. library(dplyr)
  2. library(tidyr)
  3. bench::mark(check = FALSE,
  4. "Darren Tsai" = {df %>%
  5. pivot_longer(-Group) %>%
  6. slice_max(value, by = name)},
  7. Sotos = {df %>%
  8. pivot_longer(-1) %>%
  9. group_by(name) %>%
  10. summarise(Group = Group[value == max(value)])},
  11. GuedesBF = {df |> summarise(across(starts_with("sample"),
  12. ~Group[which.max(.x)])) |>
  13. pivot_longer(everything())},
  14. ThomasIsCoding = data.frame(
  15. Sample = names(df)[-1],
  16. Group = df$Group[max.col(t(df[-1]))]
  17. ),
  18. GKi1 = cbind(lapply(df[-1], \(x) df$Group[which.max(x)])),
  19. GKi2 = cbind(vapply(df[-1], \(x) df$Group[which.max(x)], "")),
  20. GKi3 = data.frame(Sample = names(df)[-1],
  21. Group = df$Group[vapply(df[-1], which.max, 0L)])
  22. )

结果

  1. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  2. <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
  3. 1 Darren Tsai 5.66ms 5.74ms 173. 3.31MB 8.55 81 4
  4. 2 Sotos 6.42ms 6.56ms 152. 1.11MB 11.1 68 5
  5. 3 GuedesBF 5.81ms 5.91ms 169. 402.38KB 8.55 79 4
  6. 4 ThomasIsCoding 292.09µs 308.49µs 3204. 96.38KB 10.3 1563 5
  7. 5 GKi1 35.3µs 38.89µs 25198. 11.1KB 12.6 9995 5
  8. 6 GKi2 36.09µs 39.67µs 24528. 7.96KB 14.7 9994 6
  9. 7 GKi3 215.37µs 225.92µs 4348. 0B 10.2 2123 5
展开查看全部

相关问题