我有语音转录与扬声器ID在speaker
和timestamp
s当讲话发生:
df
line speaker utterance timestamp
1007 0504 A <and then HH. somehow> and then 00:09:08.951 - 00:09:18.195
1009 0505 B [mhm] 00:09:13.518 - 00:09:13.802
1011 0506 B [yeah yeah yeah] 00:09:15.518 - 00:09:15.959
1013 0507 <NA> (0.484) 00:09:18.195 - 00:09:18.679
1015 0508 A I do n't know if you 00:09:18.679 - 00:09:21.478
1017 0509 <NA> (0.287) 00:09:21.478 - 00:09:21.765
1019 0510 B yeah the organization right? °yeah 00:09:21.765 - 00:09:23.285
1021 0511 A [yeah it 's a big] international 00:09:23.171 - 00:09:27.902
1023 0512 B [yeah] 00:09:25.096 - 00:09:25.316
1025 0513 B (0.393) 00:09:27.902 - 00:09:28.295
1027 0514 B mhm= 00:09:28.295 - 00:09:28.508
1029 0515 <NA> (0.019) 00:09:28.508 - 00:09:28.527
1031 0516 A =so they have like 00:09:28.527 - 00:09:29.133
1033 0517 A (0.500) 00:09:29.133 - 00:09:29.633
1035 0518 A normally I do n't know about 00:09:29.633 - 00:09:34.381
1037 0519 <NA> (1.497) 00:09:34.381 - 00:09:35.878
1039 0520 B one wi:th economics, [er like uh] 00:09:35.878 - 00:09:44.639
1041 0521 B [mhm] 00:09:37.389 - 00:09:38.041
1043 0522 B [mhm] 00:09:44.237 - 00:09:44.622
1045 0523 <NA> (0.645) 00:09:44.639 - 00:09:45.284
1047 0524 A U:m 00:09:45.284 - 00:09:45.647
字符串
我需要折叠**(i)由相同的扬声器和(ii)的那些行,其中utterance
不不以方括号([...]
)中的表达式开始**。EDIT我还需要 * 豁免 * 那些由相同speaker
和[...]
跟随的行折叠,直到下一个NA
。所有这一切,同时相应地收缩折叠行的时间戳。I can 执行条件(i)的操作:
library(dplyr)
library(stringr)
library(data.table)
df %>%
group_by(grp = rleid(speaker)) %>%
summarise(across(c(line, speaker), first),
utterance = str_c(utterance, collapse = ' '),
timestamp = paste(unlist(strsplit(timestamp, "[- ]+"))[c(1, n()*2)], collapse = " - "), .groups = 'drop') %>%
select(-grp)
型
但我很难实施条件(ii)。EDIT:使用filter(!grepl("^\\[.*?\\]", utterance)) %>%
至少会删除[...]
的行。但是如何 * 不 * 折叠后面的行,直到下一个NA
,我不知道。任何帮助都非常感谢!
预期效果:
df
line speaker utterance timestamp
1007 0504 A <and then HH. somehow> and then 00:09:08.951 - 00:09:18.195
1009 0505 B [mhm] 00:09:13.518 - 00:09:13.802
1011 0506 B [yeah yeah yeah] 00:09:15.518 - 00:09:15.959
1013 0507 <NA> (0.484) 00:09:18.195 - 00:09:18.679
1015 0508 A I do n't know if you 00:09:18.679 - 00:09:21.478
1017 0509 <NA> (0.287) 00:09:21.478 - 00:09:21.765
1019 0510 B yeah the organization right? °yeah 00:09:21.765 - 00:09:23.285
1021 0511 A [yeah it 's a big] international 00:09:23.171 - 00:09:27.902
1023 0512 B [yeah] 00:09:25.096 - 00:09:25.316
1025 0513 B (0.393) 00:09:27.902 - 00:09:28.295
1027 0514 B mhm= 00:09:28.295 - 00:09:28.508
1029 0515 <NA> (0.019) 00:09:28.508 - 00:09:28.527
1031 0516 A =so they have like (0.500) normally I do n't know about 00:09:28.527 - 00:09:34.381
1037 0519 <NA> (1.497) 00:09:34.381 - 00:09:35.878
1039 0520 B one wi:th economics, [er like uh] [mhm] [mhm] 00:09:35.878 - 00:09:44.622
1045 0523 <NA> (0.645) 00:09:44.639 - 00:09:45.284
1047 0524 A U:m 00:09:45.284 - 00:09:45.647
型
可重现数据:
structure(list(line = c("0504", "0505", "0506", "0507", "0508",
"0509", "0510", "0511", "0512", "0513", "0514", "0515", "0516",
"0517", "0518", "0519", "0520", "0521", "0522", "0523", "0524"
), speaker = c("A", "B", "B", NA, "A", NA, "B", "A", "B", "B",
"B", NA, "A", "A", "A", NA, "B", "B", "B", NA, "A"), utterance = c("<and then HH. somehow> and then",
"[mhm]", "[yeah yeah yeah]", "(0.484)", "I do n't know if you",
"(0.287)", "yeah the organization right? °yeah", "[yeah it 's a big] international",
"[yeah]", "(0.393)", "mhm=", "(0.019)", "=so they have like",
"(0.500)", "normally I do n't know about", "(1.497)", "one wi:th economics, [er like uh]",
"[mhm]", "[mhm]", "(0.645)", "U:m"), timestamp = c("00:09:08.951 - 00:09:18.195",
"00:09:13.518 - 00:09:13.802", "00:09:15.518 - 00:09:15.959",
"00:09:18.195 - 00:09:18.679", "00:09:18.679 - 00:09:21.478",
"00:09:21.478 - 00:09:21.765", "00:09:21.765 - 00:09:23.285",
"00:09:23.171 - 00:09:27.902", "00:09:25.096 - 00:09:25.316",
"00:09:27.902 - 00:09:28.295", "00:09:28.295 - 00:09:28.508",
"00:09:28.508 - 00:09:28.527", "00:09:28.527 - 00:09:29.133",
"00:09:29.133 - 00:09:29.633", "00:09:29.633 - 00:09:34.381",
"00:09:34.381 - 00:09:35.878", "00:09:35.878 - 00:09:44.639",
"00:09:37.389 - 00:09:38.041", "00:09:44.237 - 00:09:44.622",
"00:09:44.639 - 00:09:45.284", "00:09:45.284 - 00:09:45.647")), row.names = c(1007L,
1009L, 1011L, 1013L, 1015L, 1017L, 1019L, 1021L, 1023L, 1025L,
1027L, 1029L, 1031L, 1033L, 1035L, 1037L, 1039L, 1041L, 1043L,
1045L, 1047L), class = "data.frame")
型
1条答案
按热度按时间pbwdgjma1#
字符串