grepl在 Dataframe 中包含一个列表和一个文本列(合并grepl、strsplit和apply)

hjqgdpho  于 2023-01-15  发布在  其他
关注(0)|答案(1)|浏览(138)

我想知道,一列中的字符串是否存在于一列范围中。搜索列可以包含多个字符串,用“,“分隔。我称它们为“搜索项”。我不关心是否找到一个或多个项,但我需要知道是否存在重复项。下面是一些模拟数据

df <- data.frame(
  a=c("a","b","c, d","d"), 
  b=c(NA, "k", NA,"k"), 
  c=c("c1","c2","c3","c4, c5"), 
  search_terms=c("a",NA,"c, a","a, c5"))
df

     a    b      c search_terms
1    a <NA>     c1            a
2    b    k     c2         <NA>
3 c, d <NA>     c3         c, a
4    d    k c4, c5        a, c5

我希望的结果是:

test
1 search term found in a_c
2 <NA>
3 search term found in a_c
4 search term found in a_c

解释者:
1.检索词“a”位于a列
1.没有检索词
1.检索词“c”在a列
1.检索词“c5”在c列
所以可以在搜索列的所有子字符串中搜索一个字符串。下面的代码正确地识别了第4行中的“c5”。但是我没有进行行匹配。

df %>% mutate(test=ifelse(sapply(strsplit(df$search_terms, ", "), 
                                 function(x) {any(x == "c5")}),
                          "search term found in a_c",NA)) %>%
  select(test)

                test
1               <NA>
2               <NA>
3               <NA>
4 search term found in a_c

我进一步设法检查行的存在性,但不是当输入是字符串列表时。这段代码正确地识别了第一个匹配,但没有识别第三个或第四个匹配。

df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>% 
  mutate(test=ifelse(mapply(grepl, search_terms,a_c),
                      "search term found in a_c",NA))%>%
  select(test)

                      test
1 search term found in a_c
2                     <NA>
3                     <NA>
4                     <NA>

我希望沿着下面的代码行将两者结合起来,但是grepl只取了第一个元素,所以它正确地识别了第一个和第三个匹配项,但是没有识别出第4行中的匹配项,那么为什么any-命令不在这里工作,而是在第一行代码中工作呢?

df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>% 
  mutate(test=ifelse(apply(.,1,function(x) {
    sapply(strsplit(x["search_terms"],", "), function(y) {
      any(grepl(y,x["a_c"]))
      })
    }),"search_term in a_c",NA)
    ) %>%
  select(test)

                test
1 search term found in a_c
2               <NA>
3 search term found in a_c
4               <NA>

Warning messages:
1: Problem while computing `test = ifelse(...)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used 
2: Problem while computing `test = ifelse(...)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
eqzww0vc

eqzww0vc1#

有一个答案现在似乎被删除了。它没有像我想要的那样工作,但为下面的解决方案提供了一些关键的见解:

library(dplyr)
df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
 mutate(test=ifelse(apply(.,1,function(x) {
   sapply(strsplit(x["search_terms"],", "), function(y) {
     any(sapply(y, function(z) grepl(z,x["a_c"])))
   })
 }),"search_term in a_c",NA)
 ) %>%
 select(test)

问题是strsplit返回一个列表,因此需要在any函数中使用sapply命令,根据我对这些嵌套的*apply命令的理解:

  1. apply(.,…确保以下function(x)应用于df的每一行
  2. sapply(strsplit(应用strsplit并输出搜索项列表。对于每个搜索项列表,应用function(y)
  3. any(sapply(grepl函数应用于搜索项列表中的每一个。
    然而,我并不完全理解这个逻辑,我有一个印象,有一个更简单的方法来解决这个问题,使用更少的*apply-函数。我进一步可以想象,可能有一个更干净的tidyr-方法。然而,函数给了我想要的输出(这里,使用稍微复杂一点的df)。
df <- data.frame(
  a=c("a","b","c, d","d", "e"),
  b=c(NA, "x", NA,"y", NA),
  c=c("c1","c2","c3","c4, c5", "c5, c6"),
  search_terms=c("a",NA,"c, a","x", "l, c6"),stringsAsFactors = F)

library(dplyr)
df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
  mutate(test=ifelse(apply(.,1,function(x) {
    sapply(strsplit(x["search_terms"],", "), function(y) {
      any(sapply(y, function(z) grepl(z,x["a_c"])))
    })
  }),"search_term in a_c",NA)
  ) %>%
  select(-a_c)

     a    b      c search_terms               test
1    a <NA>     c1            a search_term in a_c
2    b    x     c2         <NA>               <NA>
3 c, d <NA>     c3         c, a search_term in a_c
4    d    y c4, c5            x               <NA>
5    e <NA> c5, c6        l, c6 search_term in a_c

相关问题