如何根据另一列的group_by汇总因子列

2nc8po8w 于 2023-06-19 发布在其他

关注(0)|答案(2)|浏览(138)

我有一个数据框，比如说3个因子列：集群、性别、疫苗接种。我希望以“自动”方式而不是逐个获得按群集列分组的性别和接种列的摘要。
例如，要获得其中一个，这工作：

cluster <- sample(1:4, 1, size = 20)
set.seed(10)
sex <- sample(0:1, 1, size = 20)
set.seed(20)
vaccine <- sample(0:1, 1, size = 20)
df <- as.data.frame(cbind(cluster,sex,vaccine))
df <- as.data.frame(lapply(df, as.factor))
df %>% 
  group_by(cluster, sex) %>% 
  summarize(count = n())

但这意味着我必须为每个变量写这个（在真实的代码中，我有兴趣从中获得摘要的40个因子）。我也尝试了这样做，以自动为所有变量：

df %>% 
  group_by(cluster) %>% 
  summarize(across(everything(), count = n()))

但这给了我以下错误
summarize()中的错误：在论证中：across(everything(), count = n()).第1组：cluster = 1。由across()中的错误引起：！...必须为空。有问题的论点：· count = n（）
是否有任何方法可以获得由一列分组的所有其他因子列的计数或百分比？（我想要的输出看起来像这样，或者每个类别的百分比）
x1c 0d1x提前感谢您

来源：https://stackoverflow.com/questions/76472170/how-to-summarise-factor-columns-according-to-the-group-by-of-another-column

2条答案

按热度按时间

c3frrgcw1#

这是一种方法，但它只工作，因为在sex和vaccine变量中有相同数量的因子。如果在数据集中有一个具有不同数量因子的变量，则由于行数不同，对list_cbind的调用将失败。

library(dplyr)
library(purrr)

set.seed(10)
cluster <- sample(1:4, 1, size = 20)
sex <- sample(0:1, 1, size = 20)
vaccine <- sample(0:1, 1, size = 20)
df <- as.data.frame(cbind(cluster,sex,vaccine))
df <- as.data.frame(lapply(df, as.factor))

# function to calculate the counts by variable, this assumes that `cluster` will be a grouping variable in all cases.
# `arrange(cluster)` ensures that resultant data frames are in the same row order provided that the number of factor levels for each variable is the same.

fun <- function(var){

  # create a summary variable name based on the input variable name
  sum_var <- paste0("count_", as.name(var))
  
  df1 <- 
    df |>
    summarise(!!sum_var := n(), .by = c(cluster, {{var}})) |> 
    arrange(cluster)
  
  return(df1)
  
  }

# vector for all other names in the data set
arg1 = names(df)[-1]

# use purrr::map to loop through the variables and bind the resulting  dataframes. Finally a bit of tidying up to remove the duplicated cluster columns. 
 
map(arg1, fun) |> 
  list_cbind() |> 
  rename(cluster = cluster...1) |> 
  select(-starts_with("cluster..."))

#> New names:
#> • `cluster` -> `cluster...1`
#> • `cluster` -> `cluster...4`
#>   cluster sex count_sex vaccine count_vaccine
#> 1       1   1         1       1             1
#> 2       1   0         1       0             1
#> 3       2   1         4       1             3
#> 4       2   0         1       0             2
#> 5       3   0         6       0             3
#> 6       3   1         2       1             5
#> 7       4   0         3       1             2
#> 8       4   1         2       0             3

创建于2023-06-14带有reprex v2.0.2

赞(0）回复(0）举报 2023-06-19

5lwkijsr2#

尝试merge处理生成的tibles列表。使用非对称数据集进行说明。
colnames(df)[-1]排除 cluster，留下 sex 和 vaccine 用于计数。

library(dplyr)

setNames(purrr::reduce(lapply(colnames(df)[-1], \(x) 
           df %>% 
             group_by(cluster, !!rlang::sym(x)) %>% 
             count()), merge, by=1:2, all=T), 
  c(colnames(df)[1], "values", paste0(colnames(df)[-1], "_count")))
  cluster values sex_count vaccine_count
1       1      0         1             4
2       1      1         4             1
3       2      0         4             2
4       2      1         1             3
5       3      0         5             5
6       3      1         3             3
7       4      0        NA             1
8       4      1         2             1

数据

df <- structure(list(cluster = structure(c(2L, 3L, 1L, 3L, 3L, 1L, 
1L, 1L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 1L, 4L, 4L, 2L), levels = c("1", 
"2", "3", "4"), class = "factor"), sex = structure(c(1L, 1L, 
2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 
2L, 1L), levels = c("0", "1"), class = "factor"), vaccine = structure(c(2L, 
1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L), levels = c("0", "1"), class = "factor")), class = "data.frame", 
row.names = c(NA, -20L))

赞(0）回复(0）举报 2023-06-19

我来回答

如何根据另一列的group_by汇总因子列

2条答案

数据

相关问题

热门标签

最新问答