R语言 根据数据框中的最高值将数据框列名分配给特定组

3lxsmp7m  于 2023-05-11  发布在  其他
关注(0)|答案(5)|浏览(98)

我想根据列的最高值将列分配给特定的组,即Cluster1、Cluster2或Cluster3。功能建议(dplyr?)来看看是赞赏。

Group                Sample1         Sample2        Sample3
1 Cluster1             0.1             0               0.1
2 Cluster2             0.4             0.3             0.01
3 Cluster3             0               0.9             0.92

预期产量

Sample1 Cluster2
Sample2 Cluster3
Sample3 Cluster3
df <- structure(list(Group = c("Cluster1", "Cluster2", "Cluster3"), 
  Sample1 = c(0.1, 0.4, 0), Sample2 = c(0, 0.3, 0.9), Sample3 = c(0.1, 
  0.01, 0.92)), class = "data.frame", row.names = c("1", "2", "3"))
pexxcrt2

pexxcrt21#

转换为长优先并聚合,即

library(dplyr)
library(tidyr)

df %>% 
 pivot_longer(-1) %>% 
 group_by(name) %>% 
 summarise(Group = Group[value == max(value)])

# A tibble: 3 × 2
  name    Group   
  <chr>   <chr>   
1 Sample1 Cluster2
2 Sample2 Cluster3
3 Sample3 Cluster3
plicqrtu

plicqrtu2#

使用tidyverse,您可以堆叠这些Sample列,然后按组堆叠slice_max()

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(-Group) %>%
  slice_max(value, by = name)

# # A tibble: 3 × 3
#   Group    name    value
#   <chr>    <chr>   <dbl>
# 1 Cluster2 Sample1  0.4 
# 2 Cluster3 Sample2  0.9 
# 3 Cluster3 Sample3  0.92

您可以通过调整slice_max中的参数with_ties(默认为TRUE)来决定是否保留***ties***。

txu3uszq

txu3uszq3#

我们可以使用which.maxsummarise,如果需要,最后使用pivot_longer。这样,我们就不必group_by,如果sample列太多的话,速度会更慢

library(dplyr)
library(tidyr)

df |> summarise(across(starts_with("sample"),
                       ~Group[which.max(.x)])) |> 
      pivot_longer(everything())

# A tibble: 3 × 2
  name    value   
  <chr>   <chr>   
1 Sample1 Cluster2
2 Sample2 Cluster3
3 Sample3 Cluster3
eimct9ow

eimct9ow4#

使用max.col + t,我们可以创建一个类似于

data.frame(
    Sample = names(df)[-1],
    Group = df$Group[max.col(t(df[-1]))]
)

它产生

Sample    Group
1 Sample1 Cluster2
2 Sample2 Cluster3
3 Sample3 Cluster3
lhcgjxsq

lhcgjxsq5#

另一种方法是在lapply中使用which.max来子集df$Group

cbind(lapply(df[-1], \(x) df$Group[which.max(x)]))
#        [,1]      
#Sample1 "Cluster2"
#Sample2 "Cluster3"
#Sample3 "Cluster3"

或者使用vapply

cbind(vapply(df[-1], \(x) df$Group[which.max(x)], ""))

或者使用索引并创建一个data.frame

data.frame(Sample = names(df)[-1],
  Group = df$Group[vapply(df[-1], which.max, 0L)])
#   Sample    Group
#1 Sample1 Cluster2
#2 Sample2 Cluster3
#3 Sample3 Cluster3

基准

library(dplyr)
library(tidyr)

bench::mark(check = FALSE,
"Darren Tsai" = {df %>%
  pivot_longer(-Group) %>%
    slice_max(value, by = name)},
Sotos = {df %>% 
 pivot_longer(-1) %>% 
 group_by(name) %>% 
   summarise(Group = Group[value == max(value)])},
GuedesBF = {df |> summarise(across(starts_with("sample"),
                       ~Group[which.max(.x)])) |> 
      pivot_longer(everything())},
ThomasIsCoding = data.frame(
    Sample = names(df)[-1],
    Group = df$Group[max.col(t(df[-1]))]
),
GKi1 = cbind(lapply(df[-1], \(x) df$Group[which.max(x)])),
GKi2 = cbind(vapply(df[-1], \(x) df$Group[which.max(x)], "")),
GKi3 = data.frame(Sample = names(df)[-1],
                  Group = df$Group[vapply(df[-1], which.max, 0L)])
)

结果

expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 Darren Tsai      5.66ms   5.74ms      173.    3.31MB     8.55    81     4
2 Sotos            6.42ms   6.56ms      152.    1.11MB    11.1     68     5
3 GuedesBF         5.81ms   5.91ms      169.  402.38KB     8.55    79     4
4 ThomasIsCoding 292.09µs 308.49µs     3204.   96.38KB    10.3   1563     5
5 GKi1             35.3µs  38.89µs    25198.    11.1KB    12.6   9995     5
6 GKi2            36.09µs  39.67µs    24528.    7.96KB    14.7   9994     6
7 GKi3           215.37µs 225.92µs     4348.        0B    10.2   2123     5

相关问题