R：通过3个连续字母解析字符串

mspsb9vt 于 2023-01-22 发布在其他

关注(0)|答案(3)|浏览(109)

我在同一列中有项目代码和说明。项目代码可以包含字母、数字、空格和特殊字符。说明总是以至少4个字母开头。所以我想在4个字母开始的空格处拆分列。我将文本字段转换为小写

x <- c('1234 (a)-b free vacation to aruba',
       '1234:43-1b free set of dishes')

理想情况下，这将产生：

itemCode         itemDesctiption
1234 (a)-b       free vacation to aruba
1234:43-1b       free set of dishes

我试过用空格分开

[c('a', 'b', 'c', 'd', 'e', 'f')] <- str_split_fixed(x, ' ', 6)

当然，由于空格有时候是嵌入的，所以它不能给予我想要的。
我回顾了类似的问题，这是接近，但不完全是我要找的。

来源：https://stackoverflow.com/questions/75194898/r-parse-a-string-by-3-consecutive-letters

3条答案

按热度按时间

pvabu6sv1#

你可以在base R中使用strsplit来实现这一点，并使用一个前瞻来获得项目代码，然后使用sub从原始字符串中删除项目代码来获得描述：

x <- c('1234 (a)-b free vacation to aruba',
       '1234:43-1b free set of dishes')

a <- sapply(strsplit(x, '(?=[a-z]{4})', perl = TRUE), function(x) x[1])
b <- unlist(Map(function(a, b) sub(a, "", b, fixed = TRUE), a, x))

data.frame(itemCode = a, itemDescription = b, row.names = NULL)
#>      itemCode        itemDescription
#> 1 1234 (a)-b  free vacation to aruba
#> 2 1234:43-1b      free set of dishes

有一个小小的警告，[a-z]{4}只有在前4个字母不包含该集合中标准26个符号之外的字母（例如，重音字母）时才能工作。
创建于2023年1月21日，使用reprex v2.0.2

赞(0）回复(0）举报 2023-01-22

cygmwpex2#

请使用str_extract检查以下代码

data.frame(x=c('1234 (a)-b free vacation to aruba',
       '1234:43-1b free set of dishes')) %>% 
mutate(itemCode=str_extract_all(trimws(x), '\\d+.*[\\-|\\d]\\w\\s(?=\\w{4})'),
itemDesctiption=str_extract_all(trimws(x), '\\s\\w{4}\\s.*'))

创建于2023年1月21日，使用reprex v2.0.2

x    itemCode         itemDesctiption
1 1234 (a)-b free vacation to aruba  1234 (a)-b   free vacation to aruba
2     1234:43-1b free set of dishes  1234:43-1b       free set of dishes

赞(0）回复(0）举报 2023-01-22

js81xvg63#

此解决方案基于tidyr的函数extract：

library(tidyr)
library(dplyr)
data.frame(x) %>%
  extract(x,
          into = c("itemCode", "itemDescription"),
          regex = "([()0-9a-z-]+)[\\s-]+([a-z]{4,}\\s.*)"
  )
            itemCode                                                   itemDescription
1           04(4)(a)                                          vacation - 2-3 weeks obo
2         230(11)(a)                                          cars - - 18 plus winners
3                073                                                boxes of choclates
4         130(11)(a)                   wont be offering -- too expensive - see details
5      23-3057(a)(5)                     grand prize / cruise for >= 18 year old (min)
6             33-314                                      choice of prizes & $500 cash
7  656-2-316(a)(iii) free books < 100 / price < 13 & 27 dollars. / or choice of prizes
8           231-5510                                           airfare (more than 200)
9           5er20c1a                                               prizes (under $500)
10             520g2                                                   prizes over 500
11         35-42-4-9                                                   prizes 250 plus
12        130(11)(b)                                                  retired category

regex的工作原理：基本上，x中的串被分成两个捕获组，这两个捕获组包含要被提取到两个新列中的内容：

(.*[()0-9abc-])：第1个捕获组;在这里，我们Assert组以括号、数字、破折号或a、b或c中的任何一个结束（请根据需要进行修改！）
\\s：一个空白（未捕获）
([a-z]{4,}\\s.*)：第二个捕获组。这里我们Assert必须至少有4个小写字母，后跟一个空格和更多字符
编辑1：

结帐：
regex = "([()0-9a-z-]+)[\\s-]+(.*)"
似乎也有效！

编辑2：

基于itemCode从不被空白中断的观察，这也起作用：
regex = "(\\S+)[\\s-]+(.*)"

赞(0）回复(0）举报 2023-01-22

我来回答

R：通过3个连续字母解析字符串

3条答案

相关问题

热门标签

最新问答