R语言 按组确定服务使用的唯一天数

wko9yo5t  于 2023-03-27  发布在  其他
关注(0)|答案(3)|浏览(123)

我有一个数据集,其中包含个人使用服务的开始和结束日期(每集一行)。有时这些时段重叠,我想计算用户在一年中接触服务的独特天数(使用R)。尝试使用IVS包,但遇到了问题,因为这与开始和结束日期在同一天的行有关。我如何计算不同的日子,其中同一个人有一天的服务,以及多天的插曲。

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"03/04/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "12/02/2016", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"03/04/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/25/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

尝试

ab <- a %>%
  mutate(
    start_dt = as.Date(ActivityStartDate, format = "%m/%d/%Y"),
    end_dt = as.Date(ActivityEndDate, format = "%m/%d/%Y")
  ) %>%
  mutate(
    range = iv(start_dt, end_dt),
    .keep = "unused"
  )

c <-ab %>%
  group_by(ID) %>%
  mutate(group = iv_identify_group(range)) %>%
  group_by(group, .add = TRUE)

但是不适用于开始和结束日期在同一天的记录。还希望输出是一个带有日期变量的数据框,而不是一个向量,这样我就可以计算活动的总天数(不需要多次计算同一天)。

v6ylcynt

v6ylcynt1#

一种方法是过滤每个id的数据,获取并合并每行的日期序列,然后计算唯一日期的数量。不知道您需要将输出作为带有日期变量的数据框是什么意思,但我将结果转换为数据框,希望它接近您所追求的。请注意,在您的数据中,第十行的开始日期在结束日期之后,所以在下面的工作之前需要修复。我假设它们是从后到前的。

DayTotals <- sapply(seq_along(unique(eg_data$id)), function(id_index) {
  Current_id <- unique(eg_data$id)[id_index]
  Current_id_data <- eg_data %>% filter(id == Current_id)
  Current_id_dates <- apply(Current_id_data,1,function(row) {
    seq.Date(from = as.Date(row['start_dt'],format="%m/%d/%Y"),
                      to=as.Date(row['end_dt'],format="%m/%d/%Y"),
             by="day")})
  Current_id_No_Of_Days <- Current_id_dates %>% unlist %>% unique %>% length
})

DayTotalsDF <- data.frame(id=unique(eg_data$id),
                          NoOfDays=DayTotals)

> DayTotalsDF
  id NoOfDays
1  1      402
2  2      333
3  3      298
4  4      232
5  5      268
0tdrvxhp

0tdrvxhp2#

计算最小和最大日期以确保开始〈=结束的一致性。然后使用mapply()seq.Date()函数生成日期序列。使用unlist()将这些日期序列组合成一个向量,然后使用unique()删除重复日期。然后计算向量的长度以确定每个ID的活动总天数。
参见:https://www.mycompiler.io/view/Jau8tbboisq

library(dplyr)

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"03/04/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "12/02/2016", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"03/04/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/25/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

eg_data %>%
  mutate(
    start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
    end_dt = as.Date(end_dt, format = "%m/%d/%Y"),
    min_date = pmin(start_dt, end_dt),
    max_date = pmax(start_dt, end_dt)
  ) %>%
  group_by(id) %>%
  summarize(
    total_days = length(unique(unlist(mapply(seq.Date, min_date, max_date, by = "day"))))
  )

其结果是;

id total_days
  <dbl>      <int>
1     1        402
2     2        333
3     3        298
4     4        232
5     5        268

如果这不是想要的结果,请提供想要的结果,如果可能的话,请解释你是如何得出这个结果的(仅从样本数据)

izkcnapc

izkcnapc3#

我认为这是ivs的一个很好的例子,你只需要稍微调整一下你的想法,从像[ ]这样的闭区间转换到像[ )这样的半开区间,你所需要做的就是把1加到你的结束日期上,在这种情况下,它“正好起作用”。
使用半开区间也能很好地解决数学问题。
(This需要dplyr 1.1.0或更高版本)

library(dplyr, warn.conflicts = FALSE)
library(ivs)

df <- tibble(
  id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
  start_dt = c(
    "01/01/2016", "12/02/2016", "03/12/2017", "02/01/2016", 
    "03/04/2016", "01/01/2016", "03/05/2016", "05/07/2016", 
    "07/01/2016", "09/04/2016", "10/10/2016", "01/01/2016",
    "05/28/2016", "01/01/2016", "06/05/2016", "08/25/2016", 
    "11/01/2016"
  ),  
  end_dt = c(
    "12/01/2016", "12/02/2016", "05/15/2017", "05/15/2016", 
    "12/29/2016", "03/02/2016", "04/29/2016", "06/29/2016",
    "08/31/2016", "09/04/2016", "11/29/2016", "05/31/2016", 
    "08/19/2016", "06/10/2016", "07/25/2016", "08/25/2016", 
    "12/30/2016"
  )
)

df <- df %>%
  mutate(
    start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
    end_dt = as.Date(end_dt, format = "%m/%d/%Y") + 1L
  ) %>%
  mutate(
    range = iv(start_dt, end_dt),
    .keep = "unused"
  )

df %>%
  reframe(range = iv_groups(range), .by = id) %>%
  mutate(days = as.integer(iv_end(range) - iv_start(range))) %>%
  summarise(count = sum(days), .by = id)
#> # A tibble: 5 × 2
#>      id count
#>   <dbl> <int>
#> 1     1   402
#> 2     2   333
#> 3     3   286
#> 4     4   232
#> 5     5   268

你会注意到我对id 3的答案与其他解决方案不同。这是因为我认为你在第10行的原始数据集中有一个错字,其中结束日期明显早于开始日期:

df[10,]
#> # A tibble: 1 × 3
#>      id start_dt   end_dt    
#>   <dbl> <chr>      <chr>     
#> 1     3 09/04/2016 03/04/2016

iVS自动为我检测到了这个:

#> Error in `mutate()`:
#> ℹ In argument: `range = iv(start_dt, end_dt)`.
#> Caused by error in `iv()`:
#> ! `start` must be less than `end`.
#> ℹ `start` is not less than `end` at locations: `10`.

相关问题