regex 基于正则表达式模式折叠行

我有语音转录与扬声器ID在speaker和timestamp s当讲话发生：

df
     line speaker                          utterance                   timestamp
1007 0504       A    <and then HH. somehow> and then 00:09:08.951 - 00:09:18.195
1009 0505       B                              [mhm] 00:09:13.518 - 00:09:13.802
1011 0506       B                   [yeah yeah yeah] 00:09:15.518 - 00:09:15.959
1013 0507    <NA>                            (0.484) 00:09:18.195 - 00:09:18.679
1015 0508       A               I do n't know if you 00:09:18.679 - 00:09:21.478
1017 0509    <NA>                            (0.287) 00:09:21.478 - 00:09:21.765
1019 0510       B yeah the organization right? °yeah 00:09:21.765 - 00:09:23.285
1021 0511       A   [yeah it 's a big] international 00:09:23.171 - 00:09:27.902
1023 0512       B                             [yeah] 00:09:25.096 - 00:09:25.316
1025 0513       B                            (0.393) 00:09:27.902 - 00:09:28.295
1027 0514       B                               mhm= 00:09:28.295 - 00:09:28.508
1029 0515    <NA>                            (0.019) 00:09:28.508 - 00:09:28.527
1031 0516       A                 =so they have like 00:09:28.527 - 00:09:29.133
1033 0517       A                            (0.500) 00:09:29.133 - 00:09:29.633
1035 0518       A       normally I do n't know about 00:09:29.633 - 00:09:34.381
1037 0519    <NA>                            (1.497) 00:09:34.381 - 00:09:35.878
1039 0520       B  one wi:th economics, [er like uh] 00:09:35.878 - 00:09:44.639
1041 0521       B                              [mhm] 00:09:37.389 - 00:09:38.041
1043 0522       B                              [mhm] 00:09:44.237 - 00:09:44.622
1045 0523    <NA>                            (0.645) 00:09:44.639 - 00:09:45.284
1047 0524       A                                U:m 00:09:45.284 - 00:09:45.647

字符串
我需要折叠**（i）由相同的扬声器和（ii）的那些行，其中utterance不不以方括号（[...]）中的表达式开始**。EDIT我还需要 * 豁免 * 那些由相同speaker和[...]跟随的行折叠，直到下一个NA。所有这一切，同时相应地收缩折叠行的时间戳。I can 执行条件（i）的操作：

library(dplyr)
library(stringr)
library(data.table)
df %>%
  group_by(grp = rleid(speaker)) %>%
  summarise(across(c(line, speaker), first), 
            utterance = str_c(utterance, collapse = ' '), 
            timestamp = paste(unlist(strsplit(timestamp, "[- ]+"))[c(1, n()*2)], collapse = " - "), .groups = 'drop') %>%
  select(-grp)

型
但我很难实施条件（ii）。EDIT：使用filter(!grepl("^\\[.*?\\]", utterance)) %>%至少会删除[...]的行。但是如何 * 不 * 折叠后面的行，直到下一个NA，我不知道。任何帮助都非常感谢！

预期效果：

df
     line speaker                                               utterance                   timestamp
1007 0504       A                         <and then HH. somehow> and then 00:09:08.951 - 00:09:18.195
1009 0505       B                                                   [mhm] 00:09:13.518 - 00:09:13.802
1011 0506       B                                        [yeah yeah yeah] 00:09:15.518 - 00:09:15.959
1013 0507    <NA>                                                 (0.484) 00:09:18.195 - 00:09:18.679
1015 0508       A                                    I do n't know if you 00:09:18.679 - 00:09:21.478
1017 0509    <NA>                                                 (0.287) 00:09:21.478 - 00:09:21.765
1019 0510       B                      yeah the organization right? °yeah 00:09:21.765 - 00:09:23.285
1021 0511       A                        [yeah it 's a big] international 00:09:23.171 - 00:09:27.902
1023 0512       B                                                  [yeah] 00:09:25.096 - 00:09:25.316
1025 0513       B                                                 (0.393) 00:09:27.902 - 00:09:28.295
1027 0514       B                                                    mhm= 00:09:28.295 - 00:09:28.508
1029 0515    <NA>                                                 (0.019) 00:09:28.508 - 00:09:28.527
1031 0516       A =so they have like (0.500) normally I do n't know about 00:09:28.527 - 00:09:34.381
1037 0519    <NA>                                                 (1.497) 00:09:34.381 - 00:09:35.878
1039 0520       B           one wi:th economics, [er like uh] [mhm] [mhm] 00:09:35.878 - 00:09:44.622
1045 0523    <NA>                                                 (0.645) 00:09:44.639 - 00:09:45.284
1047 0524       A                                                     U:m 00:09:45.284 - 00:09:45.647

型

可重现数据：

structure(list(line = c("0504", "0505", "0506", "0507", "0508", 
"0509", "0510", "0511", "0512", "0513", "0514", "0515", "0516", 
"0517", "0518", "0519", "0520", "0521", "0522", "0523", "0524"
), speaker = c("A", "B", "B", NA, "A", NA, "B", "A", "B", "B", 
"B", NA, "A", "A", "A", NA, "B", "B", "B", NA, "A"), utterance = c("<and then HH. somehow> and then", 
"[mhm]", "[yeah yeah yeah]", "(0.484)", "I do n't know if you", 
"(0.287)", "yeah the organization right? °yeah", "[yeah it 's a big] international", 
"[yeah]", "(0.393)", "mhm=", "(0.019)", "=so they have like", 
"(0.500)", "normally I do n't know about", "(1.497)", "one wi:th economics, [er like uh]", 
"[mhm]", "[mhm]", "(0.645)", "U:m"), timestamp = c("00:09:08.951 - 00:09:18.195", 
"00:09:13.518 - 00:09:13.802", "00:09:15.518 - 00:09:15.959", 
"00:09:18.195 - 00:09:18.679", "00:09:18.679 - 00:09:21.478", 
"00:09:21.478 - 00:09:21.765", "00:09:21.765 - 00:09:23.285", 
"00:09:23.171 - 00:09:27.902", "00:09:25.096 - 00:09:25.316", 
"00:09:27.902 - 00:09:28.295", "00:09:28.295 - 00:09:28.508", 
"00:09:28.508 - 00:09:28.527", "00:09:28.527 - 00:09:29.133", 
"00:09:29.133 - 00:09:29.633", "00:09:29.633 - 00:09:34.381", 
"00:09:34.381 - 00:09:35.878", "00:09:35.878 - 00:09:44.639", 
"00:09:37.389 - 00:09:38.041", "00:09:44.237 - 00:09:44.622", 
"00:09:44.639 - 00:09:45.284", "00:09:45.284 - 00:09:45.647")), row.names = c(1007L, 
1009L, 1011L, 1013L, 1015L, 1017L, 1019L, 1021L, 1023L, 1025L, 
1027L, 1029L, 1031L, 1033L, 1035L, 1037L, 1039L, 1041L, 1043L, 
1045L, 1047L), class = "data.frame")

型

library(tidyverse)

process_transcript <- function(utterances, speaker_col) {
  utterance_group <- 1
  out <- c()
  current_speaker <- speaker_col[1]
  square_brackets <- FALSE
  
  for (i in seq_along(utterances)) {
    speaking <- speaker_col[i]
    if (is.na(speaking)) {
      speaking <- "NA"
    }
    square_brackets <- substr(utterances[i], 1, 1) == "[" | square_brackets 
    if (speaking != current_speaker) {
      utterance_group <- utterance_group + 1
      current_speaker <- speaking
      square_brackets <- substr(utterances[i], 1, 1) == "["
    } else if (square_brackets) {
      utterance_group <- utterance_group + 1
    }
    
    out <- c(out, utterance_group)
  }
  
  return(out)
}
df %>% 
    separate(timestamp, c("start", "end"), sep = " - ") %>%
    mutate(utterance_group = process_transcript(utterance, speaker)) %>%
    group_by(utterance_group) %>%
    mutate(utterance = paste(utterance, collapse = " "), 
                start = min(start),
                end = max(end)) %>%
    ungroup()

# A tibble: 21 × 6
   line  speaker utterance                           start end   utterance_group
   <chr> <chr>   <chr>                               <chr> <chr>           <dbl>
 1 0504  A       <and then HH. somehow> and then     00:0… 00:0…               1
 2 0505  B       [mhm]                               00:0… 00:0…               2
 3 0506  B       [yeah yeah yeah]                    00:0… 00:0…               3
 4 0507  NA      (0.484)                             00:0… 00:0…               4
 5 0508  A       I do n't know if you                00:0… 00:0…               5
 6 0509  NA      (0.287)                             00:0… 00:0…               6
 7 0510  B       yeah the organization right? °yeah  00:0… 00:0…               7
 8 0511  A       [yeah it 's a big] international    00:0… 00:0…               8
 9 0512  B       [yeah]                              00:0… 00:0…               9
10 0513  B       (0.393)                             00:0… 00:0…              10
11 0514  B       mhm=                                00:0… 00:0…              11
12 0515  NA      (0.019)                             00:0… 00:0…              12
13 0516  A       =so they have like (0.500) normall… 00:0… 00:0…              13
14 0517  A       =so they have like (0.500) normall… 00:0… 00:0…              13
15 0518  A       =so they have like (0.500) normall… 00:0… 00:0…              13
16 0519  NA      (1.497)                             00:0… 00:0…              14
17 0520  B       one wi:th economics, [er like uh]   00:0… 00:0…              15
18 0521  B       [mhm]                               00:0… 00:0…              16
19 0522  B       [mhm]                               00:0… 00:0…              17
20 0523  NA      (0.645)                             00:0… 00:0…              18
21 0524  A       U:m                                 00:0… 00:0…              19

字符串

regex 基于正则表达式模式折叠行

1条答案

相关问题

热门标签

最新问答