如何从给定的时间戳值创建一个间隔为x分钟的时间戳桶?

20jt8wwn  于 2021-07-24  发布在  Java
关注(0)|答案(1)|浏览(475)

我有一个mydf表,它有带有设备的时间戳列。我想继续合并时间戳,只要两个连续时间戳之间的差异等于或小于30分钟。开始时间戳将标记为开始时间戳,当间隔超过30分钟时,我将结束访问,并将该结束分类为结束时间戳,如下面给出的示例所示

df<-data.frame(customer=rep("XYZ",4),device=rep("x",4),time_stamps=c("2020-05-13 07:50:06","2020-05-13 07:55:06","2020-05-13 08:05:06","2020-05-13 08:50:06"))
df1<-data.frame(customer=rep("XYZ",3),device=rep("y",3),time_stamps=c("2020-05-14 07:50:06","2020-05-14 08:15:06","2020-05-14 08:25:06"))
df2<-data.frame(customer=rep("XYZ",1),device=rep("z",1),time_stamps=c("2020-05-16 09:50:06"))
df3<-data.frame(customer=rep("XYZ",2),device=rep("a",2),time_stamps=c("2020-05-16 09:50:06","2020-05-16 19:50:06"))
df4<-data.frame(customer=rep("XYZ",2),device=rep("b",2),time_stamps=c("2020-05-17 09:50:06","2020-05-17 10:15:06"))
df5<-data.frame(customer=rep("XYZ",4),device=rep("c",4),time_stamps=c("2020-05-13 07:50:06","2020-05-13 07:55:06","2020-05-13 08:05:06","2020-05-13 08:32:06"))

mydf<-rbind(df,df1,df2,df3,df4,df5)

这是我期望的Dataframe

expected_df<-data.frame(customer=rep("XYZ",8),device=c("x","x","y","z","a","a","b","c"),
        start_timestamp=c("2020-05-13 07:50:06","2020-05-13 08:50:06","2020-05-14 07:50:06","2020-05-16 09:50:06","2020-05-16 09:50:06","2020-05-16 19:50:06","2020-05-17 09:50:06","2020-05-13 07:50:06"),
        end_startstamp=c("2020-05-13 08:05:06","2020-05-13 08:50:06","2020-05-14 08:25:06","2020-05-16 09:50:06","2020-05-16 09:50:06","2020-05-16 19:50:06","2020-05-17 10:15:06","2020-05-13 08:32:06"))
hiz5n14c

hiz5n14c1#

关键是建立我们可以 group_by . 为此,我们确定了 30 * 60 秒,然后使用 rle 要整合它们:

library(dplyr)

mydf %>% 
  group_by(customer, device) %>% 
  mutate(time_stamps = as.POSIXct(time_stamps),
         diff = time_stamps - lag(time_stamps, default = first(time_stamps)),
         same_group_as_lag = diff <= 30*60,
         group = with(rle(same_group_as_lag), rep(seq_along(lengths), lengths)))

# > # A tibble: 16 x 6

# > # Groups:   customer, device [6]

# >    customer device time_stamps         diff       same_group_as_lag  group

# >    <fct>    <fct>  <dttm>              <drtn>     <lgl>              <int>

# >  1 XYZ      x      2020-05-13 07:50:06     0 secs TRUE                   1

# >  2 XYZ      x      2020-05-13 07:55:06   300 secs TRUE                   1

# >  3 XYZ      x      2020-05-13 08:05:06   600 secs TRUE                   1

# >  4 XYZ      x      2020-05-13 08:50:06  2700 secs FALSE                  2

# >  5 XYZ      y      2020-05-14 07:50:06     0 secs TRUE                   1

# >  6 XYZ      y      2020-05-14 08:15:06  1500 secs TRUE                   1

# >  7 XYZ      y      2020-05-14 08:25:06   600 secs TRUE                   1

# >  8 XYZ      z      2020-05-16 09:50:06     0 secs TRUE                   1

# >  9 XYZ      a      2020-05-16 09:50:06     0 secs TRUE                   1

# > 10 XYZ      a      2020-05-16 19:50:06 36000 secs FALSE                  2

# > 11 XYZ      b      2020-05-17 09:50:06     0 secs TRUE                   1

# > 12 XYZ      b      2020-05-17 10:15:06  1500 secs TRUE                   1

# > 13 XYZ      c      2020-05-13 07:50:06     0 secs TRUE                   1

# > 14 XYZ      c      2020-05-13 07:55:06   300 secs TRUE                   1

# > 15 XYZ      c      2020-05-13 08:05:06   600 secs TRUE                   1

# > 16 XYZ      c      2020-05-13 08:32:06  1620 secs TRUE                   1

那么,总结一下:

mydf %>% 
  group_by(customer, device) %>% 
  mutate(time_stamps = as.POSIXct(time_stamps),
         diff = time_stamps - lag(time_stamps, default = first(time_stamps)),
         same_group_as_lag = diff <= 30*60,
         group = with(rle(same_group_as_lag), rep(seq_along(lengths), lengths))) %>% 
  group_by(group, add = TRUE) %>% 
  summarise(start_timestamp = min(time_stamps),
            end_startstamp = max(time_stamps))

# > # A tibble: 8 x 5

# > # Groups:   customer, device [6]

# >   customer device group start_timestamp     end_startstamp

# >   <fct>    <fct>  <int> <dttm>              <dttm>

# > 1 XYZ      x          1 2020-05-13 07:50:06 2020-05-13 08:05:06

# > 2 XYZ      x          2 2020-05-13 08:50:06 2020-05-13 08:50:06

# > 3 XYZ      y          1 2020-05-14 07:50:06 2020-05-14 08:25:06

# > 4 XYZ      z          1 2020-05-16 09:50:06 2020-05-16 09:50:06

# > 5 XYZ      a          1 2020-05-16 09:50:06 2020-05-16 09:50:06

# > 6 XYZ      a          2 2020-05-16 19:50:06 2020-05-16 19:50:06

# > 7 XYZ      b          1 2020-05-17 09:50:06 2020-05-17 10:15:06

# > 8 XYZ      c          1 2020-05-13 07:50:06 2020-05-13 08:32:06

由reprex软件包(v0.3.0)于2020-06-25创建

相关问题