如何在R中根据数据框中其他列的模式创建新列

gt0wga4j  于 2023-01-28  发布在  其他
关注(0)|答案(3)|浏览(117)

我有这样一个数据框:
| 身份证|w1|w2|w3|w4|w5|w6|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 十一|光|光|光|光|光|光|
| 二十二|光|光|光|光|培养基|培养基|
| 三十三|光|光|培养基|培养基|培养基|沉重|
| 四十四|光|光|培养基|不适用|不适用|不适用|
| 五十五|光|光|培养基|培养基|不适用|不适用|
| 六十六|培养基|培养基|培养基|不适用|不适用|不适用|
我想得到w1-w 6中每个id的轻、中、重的频率计数,并且我想得到w1-w 6的模式作为一个新列。
目标df应该如下所示:
| 身份证|w1|w2|w3|w4|w5|w6|N_光|N_中等|N_重度|最后的|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 十一|光|光|光|光|光|光|六个|无|无|光|
| 二十二|光|光|光|光|培养基|培养基|四个|第二章|无|光|
| 三十三|光|光|培养基|培养基|培养基|沉重|第二章|三个|1个|培养基|
| 四十四|光|光|培养基|不适用|不适用|不适用|第二章|1个|无|光|
| 五十五|光|光|培养基|培养基|不适用|不适用|第二章|第二章|无|光|
| 六十六|培养基|培养基|培养基|不适用|不适用|不适用|无|三个|无|培养基|
真实的的 Dataframe 有数百万行。我很难找到一种有效的方法来做到这一点。有什么想法吗?
我尝试了DescTools库中的Mode函数,它在for循环中处理有限的行数,但是运行起来太慢了。

piv4azn7

piv4azn71#

我知道这需要dplyr,但是如果其他人发现base R有用,您可以简单地索引和使用*apply函数

xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
 # or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)

df[newnames] <- sapply(xx, 
                       function(x) rowSums(df[,-1] == x, 
                                           na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]

输出:

id     w1     w2     w3     w4     w5     w6 N_light N_medium N_heavy  final
1 11  light  light  light  light  light  light       6        0       0  light
2 22  light  light  light  light medium medium       4        2       0  light
3 33  light  light medium medium medium  heavy       2        3       1 medium
4 44  light  light medium   <NA>   <NA>   <NA>       2        1       0  light
5 55  light  light medium medium   <NA>   <NA>       2        2       0  light
6 66 medium medium medium   <NA>   <NA>   <NA>       0        3       0 medium
uplii1fm

uplii1fm2#

在Base R中,您可以执行以下操作:

a <- table(cbind(dat[1], stack(dat, -1))[1:2])
cbind(dat, as.data.frame.matrix(a), final = colnames(a)[max.col(a)])

   id     w1     w2     w3     w4     w5     w6 heavy light medium  final
11 11  light  light  light  light  light  light     0     6      0  light
22 22  light  light  light  light medium medium     0     4      2  light
33 33  light  light medium medium medium  heavy     1     2      3 medium
44 44  light  light medium   <NA>   <NA>   <NA>     0     2      1  light
55 55  light  light medium medium   <NA>   <NA>     0     2      2 medium
66 66 medium medium medium   <NA>   <NA>   <NA>     0     0      3 medium
qxgroojn

qxgroojn3#

下面是一个tidyverse解决方案:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA` with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  # reorder columns:
  select(id, everything())
  id     w1     w2     w3     w4 N_light N_medium N_heavy
1  1  light  light  light  light       4        0       0
2  2  light  light  light  light       4        0       0
3  3  light  light medium medium       2        2       0
4  4  light  light   <NA> medium       2        1       0
5  5  light  light medium medium       2        2       0
6  6 medium medium   <NA>  heavy       0        2       1
    • 编辑**:

如果最终目标是计算三个新列的众数,那么这可能是一种可行的方法:

# First define a function for the mode:

getmode <- function(v) {
  uniqv <- unique(v[!is.na(v)])
  uniqv[which.max(table(match(v, uniqv)))]
}

# Second, do as before:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA`with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  select(id, everything()) %>%

  # Third, add to this the computation of the mode:
  
  # compute mode:
  summarise(across(starts_with("N"), ~getmode(.)))
  N_light N_medium N_heavy
1       2        2       0

数据:

df <- structure(list(id = 1:6, w1 = c("light", "light", "light", "light", 
                                      "light", "medium"), w2 = c("light", "light", "light", "light", 
                                                                 "light", "medium"), w3 = c("light", "light", "medium", NA, "medium", 
                                                                                            NA), w4 = c("light", "light", "medium", "medium", "medium", "heavy"
                                                                                            )), class = "data.frame", row.names = c(NA, -6L))

相关问题