regex 在长文本字符串文件中的特定点\n插入新行的正则表达式

ppcbkaq5  于 2022-12-14  发布在  其他
关注(0)|答案(1)|浏览(86)

我有一个csv数据的文本文件,运行到100 s的数千个什么应该是单独的记录,但他们忘记了把新行到它。有一个重复的模式,以挑选出一个新行的开始应该是,虽然,在一个时间,一个逗号,和一个名称,例如从下面“07:04:08.401,Buzzard”。但因为字符串在文件中的1000 s行,我不能使用开始^或结束$锚字符串。
我的计划是从每一个点的开头向后执行正则表达式,直到下一个逗号,这样我就可以把str_replace()本身放回去,但在结尾加上“\n”,从而在我想要的地方插入新行。
我两个部分都需要帮助。

library(stringr)
library(data.table)

Data_raw <- c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.0007:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.3107:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.")

look_x <- function(rx) str_view_all(Data_raw, rx) 
look_x("[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)")

让我得到前面的四个字符。但是时间回到下一个逗号之前的字符是可变的。例如,它们的范围从“0.00”到“-401.31”和“Obj 2 N.A."。所以逗号就是了。所以我一直在沿着这样的路线尝试:

look_xy("(?<=,).(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)")

..如果不能得到前面有“”,后面有hh:mm:ss.sss的每个字符,那么接下来就是Buzz。
我还需要人帮忙,接下来的一点要怎么办,我已经试过了:

Data_st_rep_all_2 <- data.frame(str_replace_all("[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)",
                                              paste0(str_extract(Data_raw, "[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)"),"\n"), Data_raw))

虽然我现在想知道这是否会工作,因为所有的regex块是不同的。
我被卡住了。有人能帮忙吗?!
毫无疑问,会有一个非常整洁的解决方案,我已经完全错过了!

  • 谢谢-谢谢
    最终结果应如下所示:
Data_1 <- data.frame(Records = c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00",
                                 "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31",
                                 "07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.",
                                 "07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."))
mysplits <- max(lengths(strsplit(Data_1$Records, ",")))
Data_2 <- setDT(Data_1)[, paste0("column", 1:mysplits) := tstrsplit(Records, ",", fixed=T)]
Data_2[, Records := NULL]

或者说:

Data_raw_2 <- c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00\n07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31\n07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.\n07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.")
wd <- getwd()
write_lines(Data_raw_2, paste0(wd, '/', 'Data_raw_2.txt'))
hjzp0vay

hjzp0vay1#

这是你需要的吗?

library(stringr)
str_split(Data_raw, "(?<!^)(?=\\d{2}:\\d{2}:\\d{2}\\.\\d{3},Buzzard Brook)")
[[1]]
[1] "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00"                 
[2] "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31"              
[3] "07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."
[4] "07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."

工作原理:

  • (?<!^):负look-behind,用于Assert不希望在字符串开始处拆分
  • (?=\\d{2}:\\d{2}:\\d{2}\\.\\d{3},Buzzard Brook):positive look-behindAssert拆分点后面必须跟一个类似于时间戳的表达式、一个逗号和字符串“Buzzard Brook”

相关问题