R编程-数据清理-日期时间

trnvg8h3  于 2023-02-17  发布在  其他
关注(0)|答案(2)|浏览(110)

你好斯塔克溢出社区,
我目前正在使用一个大型数据集,该数据集包含日期/时间变量和一个数值变量,该变量量化了在一定强度的体力活动中花费的时间。

data_raw <- structure(list(`Bout Start` = c("2/8/2017 9:01:00 AM", "2/8/2017 9:23:00 AM", "2/8/2017 9:42:00 AM", "2/8/2017 11:49:00 AM", "2/8/2017 1:39:00 PM"), `Bout End` = c("2/8/2017 9:12:00 AM", "2/8/2017 9:38:00 AM", "2/8/2017 9:52:00 AM", "2/8/2017 12:05:00 PM", "2/8/2017 1:58:00 PM"),`Time in Bout` = c(11, 15, 10, 16, 19)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

我需要数据集的格式为:

data_processed <- structure(list(Date = structure(c(Date5306 = 17205, Date5307 = 17205, Date5308 = 17205, Date5309 = 17205, Date5310 = 17205), class = "Date"), Hour = structure(c(28800, 32400, 36000, 39600, 43200), class = c("hms", "difftime"), units = "secs"), `Time in Bout (Hourly)` = c(0, 36, 0, 11, 5)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

有人能帮我做这个吗?先谢谢你们!

kqlmhetl

kqlmhetl1#

首先,我们需要将data_raw中的日期时间字符串转换为实际的日期时间变量:

data <- within(data_raw, {
  `Bout Start` <- as.POSIXct(`Bout Start`, format = "%m/%d/%Y %I:%M:%S %p")
  `Bout End`   <- as.POSIXct(`Bout End`,   format = "%m/%d/%Y %I:%M:%S %p")
})

现在,您的数据如下所示:

data
#> # A tibble: 5 x 3
#>   `Bout Start`        `Bout End`          `Time in Bout`
#>   <dttm>              <dttm>                       <dbl>
#> 1 2017-02-08 09:01:00 2017-02-08 09:12:00             11
#> 2 2017-02-08 09:23:00 2017-02-08 09:38:00             15
#> 3 2017-02-08 09:42:00 2017-02-08 09:52:00             10
#> 4 2017-02-08 11:49:00 2017-02-08 12:05:00             16
#> 5 2017-02-08 13:39:00 2017-02-08 13:58:00             19

现在我们需要创建一个小时向量,您需要在该向量上检查回合:

times <- seq(as.POSIXct("2017-02-08 08:00"), by = "hour", len = 7)

棘手的部分现在只是计算分钟内的每一个小时时,有一场比赛发生:

mins <- rowSums(sapply(seq(nrow(data)), function(i) {
   a <- data$`Bout End`[i] - times
   a <- ifelse(a > 0 & a < 60, a, 0)
   b <- data$`Bout Start`[i] - times
   b <- ifelse(b > 0 & b < 60, b, 0)
   (a - b) %% 60
}))

最后,我们创建结果的数据框架:

data.frame(Date = as.Date(head(times, -1)),
           Hour = strftime(head(times, -1), "%H:%M:%S"),
           `Time in bout` = head(mins, -1), check.names = FALSE)
#>         Date     Hour Time in bout
#> 1 2017-02-08 08:00:00            0
#> 2 2017-02-08 09:00:00           36
#> 3 2017-02-08 10:00:00            0
#> 4 2017-02-08 11:00:00           11
#> 5 2017-02-08 12:00:00            5
#> 6 2017-02-08 13:00:00           19

创建于2023年2月15日,使用reprex v2.0.2

vfhzx4xs

vfhzx4xs2#

这是一项相当复杂的任务,下面是一种tidyverse方法

  • 获取日期/小时序列,以便能够填充缺失数据dd1
  • 将跨越小时的时间拆分到其对应的小时bin dd2
  • 连接dd1dd2
  • 此外,还可以动态地将字符串转换为日期和小时/完整小时

注意,这是一个动态的方法;开始和结束小时/日期一旦出现在原始数据中就显示出来。

library(dplyr) # >= v1.1.0 for ".by" in full_join's summarize and consecutive_id
library(tidyr) # separate and replace_na
library(lubridate) # date functions

dd1 <- tibble(ID = seq(
         ymd_hms(format(first(mdy_hms(data_raw$`Bout Start`)),
         "%Y-%m-%d %H:00:00")), 
         ymd_hms(format(last(mdy_hms(data_raw$`Bout Start`)), 
         "%Y-%m-%d %H:00:00")), 3600))

dd1
# A tibble: 5 × 1
  ID                 
  <dttm>             
1 2017-02-08 09:00:00
2 2017-02-08 10:00:00
3 2017-02-08 11:00:00
4 2017-02-08 12:00:00
5 2017-02-08 13:00:00
dd2 <- data_raw %>% 
  mutate(`Bout Start` = mdy_hms(`Bout Start`), 
         `Bout End` = mdy_hms(`Bout End`), 
         is = format(`Bout Start`, "%H") != format(`Bout End`, "%H")) %>%
  uncount(is + 1) %>% 
  group_by(grp = consecutive_id(is)) %>% 
  mutate(`Bout Start` = if_else(is & row_number() == 2, 
     ymd_hms(format(first(`Bout End`), "%Y-%m-%d %H:00:00")), `Bout Start`), 
         `Bout End` = if_else(is & row_number() == 1, 
     ymd_hms(format(first(`Bout End`), "%Y-%m-%d %H:00:00")), `Bout End`), 
         `Time in Bout` = `Bout End` - `Bout Start`, 
         ID = ymd_hms(format(`Bout Start`, "%Y-%m-%d %H:00:00")), is = NULL) %>% 
  ungroup() %>% 
  select(-grp)

dd2
# A tibble: 6 × 4
  `Bout Start`        `Bout End`          `Time in Bout` ID                 
  <dttm>              <dttm>              <drtn>         <dttm>             
1 2017-02-08 09:01:00 2017-02-08 09:12:00 11 mins        2017-02-08 09:00:00
2 2017-02-08 09:23:00 2017-02-08 09:38:00 15 mins        2017-02-08 09:00:00
3 2017-02-08 09:42:00 2017-02-08 09:52:00 10 mins        2017-02-08 09:00:00
4 2017-02-08 11:49:00 2017-02-08 12:00:00 11 mins        2017-02-08 11:00:00
5 2017-02-08 12:00:00 2017-02-08 12:05:00  5 mins        2017-02-08 12:00:00
6 2017-02-08 13:39:00 2017-02-08 13:58:00 19 mins        2017-02-08 13:00:00

连接dd1dd2,同时分隔DateHour,并将缺少日期/小时的NA替换为0

full_join(dd1, dd2, multiple="all") %>% 
  mutate(`Time in Bout` = replace_na(`Time in Bout`, duration(0))) %>% 
  summarize(`Time in Bout (Hourly)` = sum(`Time in Bout`), .by = ID) %>% 
  separate(ID, c("Date", "Hour"), sep=" ")
Joining with `by = join_by(ID)`
# A tibble: 5 × 3
  Date       Hour     `Time in Bout (Hourly)`
  <chr>      <chr>    <drtn>                 
1 2017-02-08 09:00:00 36 mins                
2 2017-02-08 10:00:00  0 mins                
3 2017-02-08 11:00:00 11 mins                
4 2017-02-08 12:00:00  5 mins                
5 2017-02-08 13:00:00 19 mins

相关问题