R语言 非行方法,如果其他列中存在多个值,则更改列中的多个值

rfbsl7qr  于 2023-05-20  发布在  其他
关注(0)|答案(2)|浏览(126)

在下面的例子df中,对于每一行,我想确定b1:b3中的任何值是否在a1:a3中。如果未找到匹配项,则将不匹配的b1:b3值更改为NA。请注意,df中a和b列的数量可能会有所不同,但命名约定保持不变:

df <- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L), 
                     b1 = c(55L,17L, 12L, 12L, 1L, 12L), 
                     b2 = c(NA, 44L, 44L, 17L, NA, 17L),
                     b3 = c(NA, 90L, 90L, 90L, NA, 44L), 
                     a1 = c(55L, 17L, 12L, 12L, 1L, 12L), 
                     a2 = c(NA, 44L, 44L, 17L, NA, 28L), 
                     a3 = c(NA, 90L, NA, 90L, NA, 44L)), 
                row.names = c(NA, -6L), 
                class = "data.frame")

所需输出:

row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

我可以用下面的for循环实现所需的输出:

acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))

for(i in 1:nrow(df)) {
  
  for(j in bcols) {
    
    if(df[i, j] %in% df[i, acols]) {
      
      next
      
    } else {
      
      df[i, j] <- NA
      
    }
      
  }

}

但我正在寻找一种更有效的方法,因为真实的数据要大得多。tidyverse或base R是否有替代方案?我已经搜索了across()方法,但我无法翻译任何解决方案来满足我的需求。

pxiryf3j

pxiryf3j1#

这里有一个基本的解决方案,应该是相当快的:

acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))

df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
    bb
  })

df
#   row b1 b2 b3 a1 a2 a3
# 1   1 55 NA NA 55 NA NA
# 2  12 17 44 90 17 44 90
# 3  17 12 44 NA 12 44 NA
# 4  44 12 17 90 12 17 90
# 5  55  1 NA NA  1 NA NA
# 6  90 12 NA 44 12 28 44

我将你的数据扩展到50万行,在这个数据集上,这种方法的运行速度提高了50倍,只使用了1/50的内存。我怀疑随着列数的增加,差异会更大。

single_loop = function(df){
  df[bcols] = lapply(
  df[bcols], 
  \(bb) {
    bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
    bb
  })
  df
}

double_loop = function(df){
  for (i in 1:nrow(df)) {
    for (j in bcols) {
      if (df[i, j] %in% df[i, acols]) {
        next
      } else {
        df[i, j] <- NA
      }
    }
  }
  df
}
df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]

bench::mark(
  op = double_loop(df_big),
  gregor = single_loop(df_big)
)
# # A tibble: 2 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   
# 1 op          624.8ms  624.8ms      1.60     381MB     9.60     1     6      625ms <df [6 ×…
# 2 gregor       12.8ms   13.4ms     48.5     7.63MB     9.33    26     5      536ms <df [6 ×…
# # … with 3 more variables: memory <list>, time <list>, gc <list>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.
bkhjykvo

bkhjykvo2#

如果需要检查任何位置的值,可以对数据进行往返透视:

library(tidyr)
library(dplyr)

df %>%
  pivot_longer(-row) %>%
  mutate(grp = substr(name, 1L, 1L),
         value = replace(value, !(value[grp == "b"] %in% value[grp == "a"])[1:n()], NA), 
         grp = NULL,
         .by = row) %>%
  pivot_wider()
  
# A tibble: 6 × 7
    row    b1    b2    b3    a1    a2    a3
  <int> <int> <int> <int> <int> <int> <int>
1     1    55    NA    NA    55    NA    NA
2    12    17    44    90    17    44    90
3    17    12    44    NA    12    44    NA
4    44    12    17    90    12    17    90
5    55     1    NA    NA     1    NA    NA
6    90    12    NA    44    12    28    44

如果可以按列对进行检查,则可以执行以下操作:

df %>%
  mutate(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))

  row b1 b2 b3 a1 a2 a3
1   1 55 NA NA 55 NA NA
2  12 17 44 90 17 44 90
3  17 12 44 NA 12 44 NA
4  44 12 17 90 12 17 90
5  55  1 NA NA  1 NA NA
6  90 12 NA 44 12 28 44

相关问题