R语言 如何展平/合并重叠的时间段

cqoc49vn  于 2023-05-20  发布在  其他
关注(0)|答案(6)|浏览(201)

我有一个很大的时间段数据集,由一个“开始”和一个“结束”列定义。有些时期是重叠的。
我想合并(flatten / merge / collapse)所有重叠的时间段,以获得一个“开始”值和一个“结束”值。
一些示例数据:

ID      start        end
1  A 2013-01-01 2013-01-05
2  A 2013-01-01 2013-01-05
3  A 2013-01-02 2013-01-03
4  A 2013-01-04 2013-01-06
5  A 2013-01-07 2013-01-09
6  A 2013-01-08 2013-01-11
7  A 2013-01-12 2013-01-15

预期结果:

ID      start        end
1  A 2013-01-01 2013-01-06
2  A 2013-01-07 2013-01-11
3  A 2013-01-12 2013-01-15

我所尝试的:

require(dplyr)
  data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"), 
    start = structure(c(1356998400, 1356998400, 1357084800, 1357257600, 
    1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200, 
    1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct", 
    "POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA, 
-7L), class = "data.frame")

remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)  
}
data2 <- na.omit(data2)}

data <- remove.overlaps(data)
olhwl3o2

olhwl3o21#

这里有一个可能的解决方案。这里的基本思想是使用cummax函数将滞后的start日期与最大结束日期“直到现在”进行比较,并创建一个索引,将数据分成组

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15
mv1qrgav

mv1qrgav2#

@大卫Arenburg的回答很棒-但我遇到了一个问题,即较早的间隔在较晚的间隔之后结束-但在summarise调用中使用last导致错误的结束日期。我建议将first(start)last(end)更改为min(start)max(end)

data %>%
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end))

此外,正如@Jonno Bourne所提到的,在应用该方法之前,按start和任何分组变量进行排序很重要。

uklbhaso

uklbhaso3#

为了完整起见,the IRanges package on Bioconductor有一些整洁的函数,可用于处理日期或日期时间范围。其中一个是reduce()函数,它合并重叠或相邻的范围。
然而,IRanges有一个缺点,因为它在整数范围内工作(因此得名),所以使用IRanges函数的便利性是以来回转换DatePOSIXct对象为代价的。
另外,dplyr似乎不能很好地与IRanges一起使用(至少从我有限的dplyr经验来看),所以我使用data.table

library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)

setDT(data)[, {
  ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
  .(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID      start        end
   <fctr>     <POSc>     <POSc>
1:      A 2013-01-01 2013-01-06
2:      A 2013-01-07 2013-01-11
3:      A 2013-01-12 2013-01-15

代码变体是

setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
  , lapply(.SD, as_datetime), .SDcols = -"width"], 
  by = ID]

在这两个变体中,使用了lubridate包中的as_datetime(),当将数字转换为POSIXct对象时,它可以指定原点。
看到IRanges方法与David's answer方法的基准比较会很有趣。

nr7wwzry

nr7wwzry4#

我认为你可以用dplyr和ivs包很好地解决这个问题,它是为处理 interval vectors 而设计的,就像你在这里所做的一样。它的灵感来自IRanges,但更适合在tidyverse中使用,并且完全通用,因此它可以自动处理日期间隔(无需转换为数字并返回)。
关键是将开始/结束边界组合成单个区间向量列,然后使用iv_groups()。这将合并区间向量中的所有重叠区间,并返回合并重叠后剩余的区间。
看起来你想按ID来做,所以我也按ID分组了。

library(ivs)
library(dplyr)

data <- tribble(
  ~ID,       ~start,         ~end,
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-02", "2013-01-03",
  "A", "2013-01-04", "2013-01-06",
  "A", "2013-01-07", "2013-01-09",
  "A", "2013-01-08", "2013-01-11",
  "A", "2013-01-12", "2013-01-15"
) %>%
  mutate(
    start = as.Date(start),
    end = as.Date(end)
  )

data
#> # A tibble: 7 × 3
#>   ID    start      end       
#>   <chr> <date>     <date>    
#> 1 A     2013-01-01 2013-01-05
#> 2 A     2013-01-01 2013-01-05
#> 3 A     2013-01-02 2013-01-03
#> 4 A     2013-01-04 2013-01-06
#> 5 A     2013-01-07 2013-01-09
#> 6 A     2013-01-08 2013-01-11
#> 7 A     2013-01-12 2013-01-15

# Combine `start` and `end` into a single interval vector column
data <- data %>%
  mutate(interval = iv(start, end), .keep = "unused")

# Note that this is a half-open interval!
data  
#> # A tibble: 7 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-05)
#> 2 A     [2013-01-01, 2013-01-05)
#> 3 A     [2013-01-02, 2013-01-03)
#> 4 A     [2013-01-04, 2013-01-06)
#> 5 A     [2013-01-07, 2013-01-09)
#> 6 A     [2013-01-08, 2013-01-11)
#> 7 A     [2013-01-12, 2013-01-15)

# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
  group_by(ID) %>%
  summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-06)
#> 2 A     [2013-01-07, 2013-01-11)
#> 3 A     [2013-01-12, 2013-01-15)

reprex package(v2.0.1)于2022-04-05创建

km0tfn4u

km0tfn4u5#

看起来我有点迟到了,但我用@zach的代码并使用下面的data.table重写了它。我没有做全面的测试,但这似乎比tidy版本快20%左右。(我无法测试IRange方法,因为R3.5.1还没有提供该包)
此外,fwiw,接受的答案没有捕获一个日期范围完全在另一个日期范围内的边缘情况(例如,2018-07-072017-07-142018-05-012018-12-01内)。@Zach的回答确实抓住了这个边缘情况。

library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]
kkih6yb8

kkih6yb86#

基准沿着更快的data.table解决方案

首先,我附和@enmyj和@zach,当一个范围完全在另一个范围内时,公认答案中的解决方案会给出错误的结果。
一种更快的方法,让人想起在公认的答案中提出的方法:
1.按ID排序,然后是所有日期(startend组合)。
1.开始日期数的累计和减去结束日期数的累计和。
1.找出和为0的索引。这些行上的日期是每个重叠日期范围集合的结束日期。下一行的日期是下一个重叠日期范围集合的开始日期。这些索引还可以用于轻松地执行其他列的汇总计算。
这只涉及一些向量化的调用,没有分组操作,因此性能非常高。
作为函数:

flatten <- function(dt) {
  setorder(dt[, rbindlist(.(.(ID, start, 1L), .(ID, end, -1L)))], V1, V2)[
    , .(
      ID = V1[i <- which(!cumsum(V3))],
      start = V2[c(1L, i[-length(i)] + 1L)],
      end = V2[i]
    )
  ]
}

基准测试

基准测试使用较大的data.table

library(data.table)
library(dplyr)
library(ivs)

data <- data.table(
  ID = sample(1e3, 1e5, 1),
  start = as.Date(sample(1e4:2e4, 1e5, 1), origin = "1970-01-01")
)[, end := start + sample(100)]

fCum <- function(dt) {
  # adapted from https://stackoverflow.com/a/47337684/9463489
  dt %>%
    arrange(ID, start) %>%
    group_by(ID) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                                cummax(as.numeric(end)))[-n()])) %>%
    group_by(ID, indx) %>%
    reframe(start = min(start), end = max(end)) %>%
    select(-indx)
}

fivs <- function(dt) {
  # adapted from https://stackoverflow.com/a/71754454/9463489
  dt %>%
    mutate(interval = iv(start, end), .keep = "unused") %>%
    group_by(ID) %>%
    reframe(interval = iv_groups(interval)) %>%
    mutate(start = iv_start(interval), end = iv_end(interval)) %>%
    select(-interval)
}

squish <- function(dt) {
  # adapted from https://stackoverflow.com/a/53890653/9463489
  setkey(dt, ID, start, end)
  dt[,.(START_DT = start, 
        END_DT = end, 
        indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-.N])),
     keyby=ID
  ][,.(start=min(START_DT), 
       end = max(END_DT)),
    by=c("ID","indx")
  ][, indx := NULL]
}

时间:

microbenchmark::microbenchmark(
  flatten = flatten(dt),
  fCum = setDT(fCum(dt)),
  fivs = setDT(fivs(dt)),
  squish = squish(dt),
  times = 10,
  check = "equal",
  setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  flatten   11.4732   11.8141   13.86760   12.36580   15.9228   19.1775    10
#>     fCum 1827.1197 1876.7701 1898.24285 1908.88640 1926.6548 1939.2919    10
#>     fivs  160.2568  163.9617  173.31783  173.32095  177.3789  192.7755    10
#>   squish   62.5197   64.9126   66.26047   65.08515   67.1685   70.9916    10

聚合其他列

flatten使用的方法还可以轻松地聚合data.table中的其他列。

data[, v := runif(1e5)]

setorder(data[, rbindlist(.(.(ID, start, 1L, 0), .(ID, end, -1L, v)))], V1, V2)[
  , .(
    ID = V1[i <- which(!cumsum(V3))],
    start = V2[c(1L, i[-length(i)] + 1L)],
    end = V2[i],
    v = diff(c(0, cumsum(V4)[i]))
  )
]
#>          ID      start        end          v
#>     1:    1 1997-09-25 1997-09-27 0.40898255
#>     2:    1 1997-11-09 1997-11-30 0.44067634
#>     3:    1 1998-04-27 1998-07-17 1.73142460
#>     4:    1 1999-08-05 1999-11-05 0.41103832
#>     5:    1 1999-12-09 2000-01-26 0.90639735
#>    ---                                      
#> 60286: 1000 2023-01-06 2023-03-28 0.54727106
#> 60287: 1000 2023-07-20 2023-10-16 1.74270130
#> 60288: 1000 2024-03-24 2024-06-23 0.07110824
#> 60289: 1000 2024-07-13 2024-07-31 0.63888263
#> 60290: 1000 2024-10-02 2024-10-19 0.22872167

相关问题