regex R中多模式字符向量的匹配

r1zhe5dt 于 2023-01-18 发布在其他

关注(0)|答案(1)|浏览(137)

我有两个data. table。第一个data.table DT_1包含字符串和匹配类型，如下所示：

library(data.table)  
DT_1 <- data.table(Source_name = c("Apple","Banana","Orange","Pear","Random"),
                   Match_type = c("Anywhere","Beginning","Anywhere","End","End"))

然后，我想使用DT_1中指定的匹配类型返回DT_1中“Source_name”字符串与DT_2的列名的第一个匹配项（如下所示）。执行匹配时不区分大小写。

DT_2 <- data.table(Pear_1 = 1,Eat_apple = 1,Split_Banana = 1,
                   Pear_2 = 1,Eat_pear = 1,Orange_peel = 1,Banana_chair = 1)

例如，字符串“Apple”可以在DT_2的列名中的任何位置找到。它的第一个示例是“Eat_apple”。
对于下一个字符串“Banana”，它必须与列名字符串的开头匹配。第一个示例是“Banana_chair”。
我已经写了一些（非常难看的）代码来处理这个问题，如下所示：

library(purrr)      
DT_1[,Col_Name := names(DT_2)[unlist(pmap(.l = .SD[,.(x = Match_type,y = Source_name)],
              .f = function(x,y){
                  if(x == "Anywhere"){
                       grep(tolower(y),tolower(names(DT_2)))[1] # returns the first match if there is a match anywhere
                  }else if (x == "Beginning"){
                       grep(paste0("^",tolower(y),"."),tolower(names(DT_2)))[1] # returns the first match if the string is at the beginning (denoted by the anchor "^")
                  }else if (x == "End"){
                        grep(paste0(".",tolower(y),"$"),tolower(names(DT_2)))[1] # returns the first match if the string is at the end (denoted by the end anchor "$")
                  }}))]]

我尝试使用stringr包中的string_extract / string_detect来重现输出，但它不喜欢这样一个事实：DT_2中的模式和列数具有不同的长度。
有谁能提供一些关于我如何改进代码的建议吗？我并不拘泥于某个特定的方法。
先谢了菲尔

regex

来源：https://stackoverflow.com/questions/75131047/matching-a-character-vector-with-multiple-patterns-in-r

1条答案

按热度按时间

gcuhipw91#

一种方法是先准备正则表达式，然后为每个Source_name找到第一个对应的匹配项。

library(dplyr)
library(purrr)
library(stringr)

cols <- names(DT_2)

DT_1 %>%
  mutate(regex = case_when(Match_type == "Anywhere" ~ Source_name, 
                           Match_type == "Beginning" ~ str_c('^',Source_name), 
                           Match_type == "End" ~str_c(Source_name, '$')), 
         Col_Name = map_chr(regex, 
                    ~str_subset(cols, regex(.x, ignore_case = TRUE))[1]))

#   Source_name Match_type   regex     Col_Name
#1:       Apple   Anywhere   Apple    Eat_apple
#2:      Banana  Beginning ^Banana Banana_chair
#3:      Orange   Anywhere  Orange  Orange_peel
#4:        Pear        End   Pear$     Eat_pear
#5:      Random        End Random$         <NA>

请注意，str_subset中的[1]在以下两种情况下非常有用
1.当存在多个匹配项时，它只返回第一个匹配项。
1.如果不存在匹配项，则返回NA。

赞(0）回复(0）举报 2023-01-18

我来回答

regex R中多模式字符向量的匹配

1条答案

相关问题

热门标签

最新问答