我有一个dataframe df,我想创建一个新的列填充某些字符串取决于关键字/字符串在'标题'列中找到。
library(tidyverse)
df <- tibble::tibble(
id = c(36933814, 36921141, 39601489, 36898335, 36432859, 33447951),
treatment_modalities = c("HIFU", "UAE", "UAE; RFA; HIFU", "UAE; RFA", "UAE; HIFU", "UAE"),
no_patients = c(32, NA, 152, NA, 15, 428),
year = c(2023, 2022, 2023, 2023, 2023, 2023),
title = c(
"Title with keyword1 and keyword2 inside of it.",
"A second title with kword3 and keyword1 inside of it.",
"Here we have kword4 and nothing else to see.",
"A title with kword4 and kword3 inside of it.",
"And one with keyword1, keyword2 and kword3 in it.",
"This title does not contain a keyword."
),
)
我可以检测并写入第一个找到的关键字/字符串,但当然case_when会停止,并且不会触发潜在的其他检测:
df2 <- df %>%
mutate(title_keyword = case_when(
str_detect(df$title, regex("keyword1", ignore_case = T)) ~ "k1",
str_detect(df$title, regex("keyword2", ignore_case = T)) ~ "k2",
str_detect(df$title, regex("kword3", ignore_case = T)) ~ "k3",
str_detect(df$title, regex("kword4", ignore_case = T)) ~ "k4",
TRUE ~ NA_character_), .after = year)
case_when是错误的辅助函数吗?也许用if_else和paste来改变?
预期产出将是:
tibble::tibble(
id = c(36933814, 36921141, 39601489, 36898335, 36432859, 33447951),
treatment_modalities = c("HIFU", "UAE", "UAE; RFA; HIFU", "UAE; RFA", "UAE; HIFU", "UAE"),
no_patients = c(32, NA, 152, NA, 15, 428),
year = c(2023, 2022, 2023, 2023, 2023, 2023),
title_keyword = c("k1; k2", "k3; k1", "k4", "k4; k3", "k1; k2; k3", NA),
title = c(
"Title with keyword1 and keyword2 inside of it.", "A second title with kword3 and keyword1 inside of it.",
"Here we have kword4 and nothing else to see.", "A title with kword3 and kword4 inside of it.",
"And one with keyword1, keyword2 and kword3 in it.", "This title does not contain a keyword."
),
)
谢谢你的帮助!
4条答案
按热度按时间5vf7fwbs1#
您可以更简单地使用
stringr
包的str_extract_all
首先提取所有关键字,然后使用str_replace_all
替换它们:正则表达式中的
+
查找模式的一个或多个示例。输出:
jobtbby32#
一种方法:
lawou6xi3#
您可以分离标记关键字的列,然后将它们合并:
mrfwxfqh4#
更新
根据评论,这里有一个不同的通用方法,可以避免写出这么多条件:
如何运作
1.首先为键值对创建一个向量。
key
是你想要匹配的,value
是你想要在key
匹配时返回的元素。1.迭代
title
,对于每个标题,我们使用str_detect
测试所有关键字。这将返回一个逻辑值向量,我们使用它来访问这些值。1.使用
str_flatten
,我们将返回值折叠成字符串("k1;k2"
)。str_flatten
将返回""
,因此我们使用na_if
将其转换为NA
。输出
老
与其有条件地设置
title_keyword
的值,为什么不直接提取您要查找的值:1.在这里,我们匹配您的
kword
和keyword
,并提取k
和末尾的数字(\\d
)。1.我们将这些捕获粘贴在一起,然后折叠它们,并以
;
分隔。