mutate、case_when和str_detect:触发多于第一个阳性案例

6ju8rftf  于 2023-04-27  发布在  其他
关注(0)|答案(4)|浏览(146)

我有一个dataframe df,我想创建一个新的列填充某些字符串取决于关键字/字符串在'标题'列中找到。

library(tidyverse)

df <- tibble::tibble(
  id = c(36933814, 36921141, 39601489, 36898335, 36432859, 33447951),
  treatment_modalities = c("HIFU", "UAE", "UAE; RFA; HIFU", "UAE; RFA", "UAE; HIFU", "UAE"),
  no_patients = c(32, NA, 152, NA, 15, 428),
  year = c(2023, 2022, 2023, 2023, 2023, 2023),
  title = c(
    "Title with keyword1 and keyword2 inside of it.",
    "A second title with kword3 and keyword1 inside of it.",
    "Here we have kword4 and nothing else to see.",
    "A title with kword4 and kword3 inside of it.",
    "And one with keyword1, keyword2 and kword3 in it.",
    "This title does not contain a keyword."
  ),
)

我可以检测并写入第一个找到的关键字/字符串,但当然case_when会停止,并且不会触发潜在的其他检测:

df2 <- df %>% 
  mutate(title_keyword = case_when(
    str_detect(df$title, regex("keyword1", ignore_case = T)) ~ "k1",
    str_detect(df$title, regex("keyword2", ignore_case = T)) ~ "k2",
    str_detect(df$title, regex("kword3", ignore_case = T)) ~ "k3",
    str_detect(df$title, regex("kword4", ignore_case = T)) ~ "k4",
    TRUE ~ NA_character_), .after = year)

case_when是错误的辅助函数吗?也许用if_else和paste来改变?
预期产出将是:

tibble::tibble(
  id = c(36933814, 36921141, 39601489, 36898335, 36432859, 33447951),
  treatment_modalities = c("HIFU", "UAE", "UAE; RFA; HIFU", "UAE; RFA", "UAE; HIFU", "UAE"),
  no_patients = c(32, NA, 152, NA, 15, 428),
  year = c(2023, 2022, 2023, 2023, 2023, 2023),
  title_keyword = c("k1; k2", "k3; k1", "k4", "k4; k3", "k1; k2; k3", NA),
  title = c(
    "Title with keyword1 and keyword2 inside of it.", "A second title with kword3 and keyword1 inside of it.",
    "Here we have kword4 and nothing else to see.", "A title with kword3 and kword4 inside of it.",
    "And one with keyword1, keyword2 and kword3 in it.", "This title does not contain a keyword."
  ),
)

谢谢你的帮助!

5vf7fwbs

5vf7fwbs1#

您可以更简单地使用stringr包的str_extract_all首先提取所有关键字,然后使用str_replace_all替换它们:

ll <- lapply(str_extract_all(df$title, regex("keyword1+|keyword2+|kword3+|kword4+", ignore_case = TRUE)),
             paste, collapse = "; ")
df$title_keyword <- unlist(lapply(ll, str_replace_all, regex(c("keyword1" = "k1",
                                                     "keyword2" = "k2",
                                                     "kword3" = "k3",
                                                     "kword4" = "k4"),
                                                   ignore_case = TRUE)))

正则表达式中的+查找模式的一个或多个示例。
输出:

# A tibble: 6 × 6
        id treatment_modalities no_patients  year title                                                 title_keyword
     <dbl> <chr>                      <dbl> <dbl> <chr>                                                 <chr>        
1 36933814 HIFU                          32  2023 Title with keyword1 and keyword2 inside of it.        "k1; k2"     
2 36921141 UAE                           NA  2022 A second title with kword3 and keyword1 inside of it. "k3; k1"     
3 39601489 UAE; RFA; HIFU               152  2023 Here we have kword4 and nothing else to see.          "k4"         
4 36898335 UAE; RFA                      NA  2023 A title with kword4 and kword3 inside of it.          "k4; k3"     
5 36432859 UAE; HIFU                     15  2023 And one with keyword1, keyword2 and kword3 in it.     "k1; k2; k3" 
6 33447951 UAE                          428  2023 This title does not contain a keyword.                ""
jobtbby3

jobtbby32#

一种方法:

paste2 <- function(x,...){
  newx <- x[nzchar(x)]
  out <- paste(newx,...)
  if_else(nzchar(out)==0,NA_character_,out)
}

df2 <- df%>%
  rowwise()%>%
  mutate(key1=ifelse(str_detect(title, regex("keyword1", ignore_case = T)),"k1",""),
         key2=ifelse(str_detect(title, regex("keyword2", ignore_case = T)),"k2",""),
         key3=ifelse(str_detect(title, regex("kword3", ignore_case = T)),"k3",""),
         key4=ifelse(str_detect(title, regex("kword4", ignore_case = T)),"k4",""),
         new=paste2(c(key1,key2,key3,key4),sep=";",collapse=";")) %>%
  select(-starts_with("key")) %>%
  ungroup()
df2

# A tibble: 6 x 6
        id treatment_modalities no_patients  year title                                                 new     
     <dbl> <chr>                      <dbl> <dbl> <chr>                                                 <chr>   
1 36933814 HIFU                          32  2023 Title with keyword1 and keyword2 inside of it.        k1;k2   
2 36921141 UAE                           NA  2022 A second title with kword3 and keyword1 inside of it. k1;k3   
3 39601489 UAE; RFA; HIFU               152  2023 Here we have kword4 and nothing else to see.          k4      
4 36898335 UAE; RFA                      NA  2023 A title with kword4 and kword3 inside of it.          k3;k4   
5 36432859 UAE; HIFU                     15  2023 And one with keyword1, keyword2 and kword3 in it.     k1;k2;k3
6 33447951 UAE                          428  2023 This title does not contain a keyword.                NA
lawou6xi

lawou6xi3#

您可以分离标记关键字的列,然后将它们合并:

df %>% 
       mutate(
         k1 = case_when(str_detect(title, regex("keyword1", ignore_case = T)) ~ "k1", TRUE ~ NA_character_),
         k2 = case_when(str_detect(title, regex("keyword2", ignore_case = T)) ~ "k2", TRUE ~ NA_character_),
         k3 = case_when(str_detect(title, regex("kword3", ignore_case = T)) ~   "k3", TRUE ~ NA_character_),
         k4 = case_when(str_detect(title, regex("kword4", ignore_case = T)) ~   "k4", TRUE ~ NA_character_)) %>%
       rowwise() %>%
       mutate(
         title_keyword=paste(c(na.omit(k1), na.omit(k2), na.omit(k3), na.omit(k4)), collapse = ";"),
         title_keyword = ifelse(title_keyword=="", NA, title_keyword)
  )
mrfwxfqh

mrfwxfqh4#

更新

根据评论,这里有一个不同的通用方法,可以避免写出这么多条件:

key <- c("keyword1", "keyword2", "kword3", "kword4")
value <- c("k1", "k2", "k3", "k4")

df |>
  mutate(title_keyword = map_chr(title, ~ str_flatten(value[str_detect(.x, key)], ";")),
         title_keyword = na_if(title_keyword, ""))

如何运作

1.首先为键值对创建一个向量。key是你想要匹配的,value是你想要在key匹配时返回的元素。
1.迭代title,对于每个标题,我们使用str_detect测试所有关键字。这将返回一个逻辑值向量,我们使用它来访问这些值。
1.使用str_flatten,我们将返回值折叠成字符串("k1;k2")。

  1. str_flatten将返回"",因此我们使用na_if将其转换为NA

输出

id treatment_modalities no_patients  year title                  title~1
     <dbl> <chr>                      <dbl> <dbl> <chr>                  <chr>  
1 36933814 HIFU                          32  2023 Title with keyword1 a~ k1;k2  
2 36921141 UAE                           NA  2022 A second title with k~ k1;k3  
3 39601489 UAE; RFA; HIFU               152  2023 Here we have kword4 a~ k4     
4 36898335 UAE; RFA                      NA  2023 A title with kword4 a~ k3;k4  
5 36432859 UAE; HIFU                     15  2023 And one with keyword1~ k1;k2;~
6 33447951 UAE                          428  2023 This title does not c~ NA

与其有条件地设置title_keyword的值,为什么不直接提取您要查找的值:

df |> 
  mutate(title_keyword = map_chr(str_match_all(title, "(k)e?y?word(\\d)"), ~ str_c(str_c(.x[,2], .x[,3]), collapse = ";")),
         title_keyword = na_if(title_keyword, ""))

1.在这里,我们匹配您的kwordkeyword,并提取k和末尾的数字(\\d)。
1.我们将这些捕获粘贴在一起,然后折叠它们,并以;分隔。

相关问题