R语言 获取一个字符串中匹配的字符索引并应用于另一个字符串

nbysray5  于 2023-05-26  发布在  其他
关注(0)|答案(2)|浏览(111)

我有下面的dataframe,其中每一行代表文本的变化。然后,我使用adist()函数提取更改是匹配(M)、插入(I)、替换(S)还是删除(D)。
我需要在change列中找到I s的所有索引(在insrtion_idx列中显示)。使用这些索引,我需要提取current_text中的相应字符(在这里以insertion_chars为例)。

df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
             previous_text = c("","A","AB","ABCD"),
             change = c("I","MI","MMII","MMSD"),
             insertion_idx = c(c(1),c(2),c(3,4),""),
             insertion_chars = c("A","B","CD",""))

我尝试过拆分字符串并比较字符串的差异,但对于真实世界的数据,这会变得非常混乱。如何完成上述任务?

p3rjfoxz

p3rjfoxz1#

把我关于使用gregexprregmatches的评论变成一个答案。
如果您正在寻找替代方法,此过程中的许多内容与此问题中的内容非常相似-Extract a regular expression match

df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
             previous_text = c("","A","AB","ABCD"),
             change = c("I","MI","MMII","MMSD"))

df$insertion_idx <- gregexpr("I", df$change)
df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx), 
                             paste, collapse="")
df
##  current_text previous_text change insertion_chars insertion_idx
##1            A                    I               A             1
##2           AB             A     MI               B             2
##3         ABCD            AB   MMII              CD          3, 4
##4          ABZ          ABCD   MMSD                            -1
u2nhd7ah

u2nhd7ah2#

尝试以下替代thelatemail的(优秀的)推荐(同样有效):

quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
quux$insertion_chars <- mapply(function(ctxt, idx) {
  if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 × 5
#   current_text previous_text change insertion_idx insertion_chars
#   <chr>        <chr>         <chr>  <list>        <chr>          
# 1 A            ""            I      <int [1]>     "A"            
# 2 AB           "A"           MI     <int [1]>     "B"            
# 3 ABCD         "AB"          MMII   <int [2]>     "CD"           
# 4 ABZ          "ABCD"        MMSD   <int [0]>     ""

请注意,insertion_idx是一个列表列,其中包含您要查找的索引:

str(quux)
# tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
#  $ current_text   : chr [1:4] "A" "AB" "ABCD" "ABZ"
#  $ previous_text  : chr [1:4] "" "A" "AB" "ABCD"
#  $ change         : chr [1:4] "I" "MI" "MMII" "MMSD"
#  $ insertion_idx  :List of 4
#   ..$ : int 1
#   ..$ : int 2
#   ..$ : int [1:2] 3 4
#   ..$ : int(0) 
#  $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
#   ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"

相关问题