在r脚本difftime输出中手动输入的最后一个值

cuxqih21  于 2023-04-27  发布在  其他
关注(0)|答案(3)|浏览(132)

R在Databricks中的编码。
我想要date1条目之间的时间间隔(以小时为单位),按pid、med和date1排列。
我希望每个日期的事件序列中的最新date1条目可以手动调整为24小时。
队列是指pid、med和date1相同。
任何更改都将终止最后一个队列,hour_output == 24。
df

pid       med                date1                    
1  1       drugA             2023-02-02 09:00:00         
2  1       drugA             2023-02-02 12:00:00         
3  1       drugA             2023-02-02 14:00:00        
4  1       drugB             2023-02-03 10:00:00         
5  1       drugB             2023-02-03 18:00:00

尝试的脚本。

df1 <- df %>%
  arrange(pid, med, date1) %>%
  mutate(hours_output = as.numeric(difftime(lead(date1), date1, units = "hours")))

# Replace the last duration value with 24 hours
df1$hours_output[last(nrow(df1))] <- 24

df1 <- df1 %>% select(med, date1, hours_output)
head(df1)

实际产量

pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        20.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        18.00 (18 hours to the next row - not shown)

期望输出

pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        24.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        24.00
nbewdwxp

nbewdwxp1#

由于您希望按患者和药物进行计算,因此应该使用group_by,这样差异就不会错误地计算出不同组之间的差异。

library(dplyr) # 1.1.0 for .by=
df %>%
  mutate(date1 = as.POSIXct(date1)) %>% # may not be needed with your real data
  mutate(
    hours_output = as.numeric(c(diff(date1), 24), units="hours"),
    .by = c(pid, med)
  )
#   pid   med               date1 hours_output
# 1   1 drugA 2023-02-02 09:00:00      3 hours
# 2   1 drugA 2023-02-02 12:00:00      2 hours
# 3   1 drugA 2023-02-02 14:00:00     24 hours
# 4   1 drugB 2023-02-03 10:00:00      8 hours
# 5   1 drugB 2023-02-03 18:00:00     24 hours

我正在使用.by=,这是dplyr_1.1.0的新特性;如果您有更早版本,那么显式使用group_by

df %>%
  mutate(date1 = as.POSIXct(date1)) %>%
  group_by(pid, med) %>%
  mutate(hours_output = as.numeric(c(diff(date1), 24), units="hours"))
p3rjfoxz

p3rjfoxz2#

我可以在Databricks中将其作为示例。
使用虚拟数据的工作示例

df <- df %>% arrange(datetime_col)

# Sample dataframe with datetime values
df <- data.frame(datetime_col = c("2023-02-02 09:00:00", "2023-02-02 12:00:00", "2023-02-02 14:00:00"))

# Convert datetime column to POSIXct object
df$datetime_col <- as.POSIXct(df$datetime_col, format = "%Y-%m-%d %H:%M:%S")
#df$datetime_col

# Calculate duration between consecutive datetime values, including last interval
durations <- c(diff(df$datetime_col, units = "hours"), 0)

# Convert durations to hours
durations <- as.numeric(durations, units = "hours")

# Replace last duration value with 24 hours
durations[length(durations)] <- 24

durations <- round(durations, 2)

df$duration <- durations

df_f <- df %>% select(datetime_col, duration)
df_f
lrl1mhuk

lrl1mhuk3#

下面的工作在虚拟数据(如在问题的顶部)时,在Databricks中使用。
提供所需的输出。
(Also,处理Databricks中的实际数据)

library(dplyr)
library(lubridate)

# Convert datetime column to POSIXct object
df$date1 <- ymd_hms(df$date1)

df <- df %>% arrange(date1)

# Calculate duration between consecutive datetime values, including last interval
durations <- c(diff(df$date1), 0)

# Convert durations to hours and round to 2 decimal places
durations <- round(as.numeric(durations, units = "hours"), 2)

# Replace any negative values with 0
durations[durations < 0] <- 0

# Find last timestamp for each date and replace duration with 24 hours
last_times <- dc_4 %>%
  group_by(Date = as.Date(date1)) %>%
  slice_tail(n = 1) %>%
  ungroup()

durations[df$date1 %in% last_times$date1] <- 24

df$duration <- durations

df1 <- df %>% select(date1, duration)

head(df1, 10)

相关问题