R语言 按组查找特定日期后特定值的运行中的第一行

093gszye  于 2023-01-28  发布在  其他
关注(0)|答案(5)|浏览(131)

我有关于不同熊的时间数据("ID"),在不同的位置("Position";陆地或冰)。下面是两个个体(A和B)的简化版本:

ID <- rep.int(c("A", "B"), times = c(10, 10))
Dates <- c(seq(as.Date("2011-06-11"), as.Date("2011-06-20"), by = "days"),
               seq(as.Date("2011-05-27"), as.Date("2011-06-05"), by="days"))
Position <- c("Land", "Ice", "Land", "Land", "Ice", "Ice", "Land", "Land", "Land", "Land",
              "Land", "Land", "Land", "Ice", "Ice", "Land", "Land", "Land", "Ice", "Ice")

data <- data.frame(ID, Dates, Position)
ID      Dates Position
1   A 2011-06-11     Land
2   A 2011-06-12      Ice
3   A 2011-06-13     Land
4   A 2011-06-14     Land
5   A 2011-06-15      Ice
6   A 2011-06-16      Ice
7   A 2011-06-17     Land
8   A 2011-06-18     Land
9   A 2011-06-19     Land
10  A 2011-06-20     Land
11  B 2011-05-27     Land
12  B 2011-05-28     Land
13  B 2011-05-29     Land
14  B 2011-05-30      Ice
15  B 2011-05-31      Ice
16  B 2011-06-01     Land
17  B 2011-06-02     Land
18  B 2011-06-03     Land
19  B 2011-06-04      Ice
20  B 2011-06-05      Ice

我想创建一个变量Arrival,它表示每只熊到达陆地的日期。我将到达陆地定义为在"陆地"上连续运行三个Position的第一行的日期。此行应设置为"到达",其他行应设置为NA。此日期也必须发生在5月31日之后。
对于此数据集,到达日期如下所示:

ID      Dates Position Arrival
1   A 2011-06-11     Land      NA
2   A 2011-06-12      Ice      NA
3   A 2011-06-13     Land      NA
4   A 2011-06-14     Land      NA
5   A 2011-06-15      Ice      NA
6   A 2011-06-16      Ice      NA
7   A 2011-06-17     Land Arrival
8   A 2011-06-18     Land      NA
9   A 2011-06-19     Land      NA
10  A 2011-06-20     Land      NA
11  B 2011-05-27     Land      NA
12  B 2011-05-28     Land      NA
13  B 2011-05-29     Land      NA
14  B 2011-05-30      Ice      NA
15  B 2011-05-31      Ice      NA
16  B 2011-06-01     Land Arrival
17  B 2011-06-02     Land      NA
18  B 2011-06-03     Land      NA
19  B 2011-06-04      Ice      NA
20  B 2011-06-05      Ice      NA

有没有一种方法可以在R中实现,最好是使用dplyr?

ldfqzlk8

ldfqzlk81#

我们可以使用zoo::rollapply来完成这个任务。

library(dplyr)
data %>%
  group_by(ID) %>%
  mutate(
    Arrival = Dates > "2011-05-31" &
         lag(Position != "Land", default = FALSE) &
         zoo::rollapply(Position == "Land", 3, align = "left", FUN = all, partial = TRUE)
  ) %>%
  ungroup()
# # A tibble: 20 × 4
#    ID    Dates      Position Arrival
#    <chr> <date>     <chr>    <lgl>  
#  1 A     2011-06-11 Land     FALSE  
#  2 A     2011-06-12 Ice      FALSE  
#  3 A     2011-06-13 Land     FALSE  
#  4 A     2011-06-14 Land     FALSE  
#  5 A     2011-06-15 Ice      FALSE  
#  6 A     2011-06-16 Ice      FALSE  
#  7 A     2011-06-17 Land     TRUE   
#  8 A     2011-06-18 Land     FALSE  
#  9 A     2011-06-19 Land     FALSE  
# 10 A     2011-06-20 Land     FALSE  
# 11 B     2011-05-27 Land     FALSE  
# 12 B     2011-05-28 Land     FALSE  
# 13 B     2011-05-29 Land     FALSE  
# 14 B     2011-05-30 Ice      FALSE  
# 15 B     2011-05-31 Ice      FALSE  
# 16 B     2011-06-01 Land     TRUE   
# 17 B     2011-06-02 Land     FALSE  
# 18 B     2011-06-03 Land     FALSE  
# 19 B     2011-06-04 Ice      FALSE  
# 20 B     2011-06-05 Ice      FALSE

带动物园的R基地

data$prevnotland <- ave(
  data$Position != "Land", data$ID, 
  FUN = function(z) c(FALSE, z[-length(z)]))
data$Arrival <- data$prevnotland & ave(
  data$Dates > "2011-05-31" & data$Position == "Land", data$ID,
  FUN = function(z) zoo::rollapply(z, 3, FUN=all, align="left", partial=TRUE))
data
#    ID      Dates Position prevnotland Arrival
# 1   A 2011-06-11     Land       FALSE   FALSE
# 2   A 2011-06-12      Ice       FALSE   FALSE
# 3   A 2011-06-13     Land        TRUE   FALSE
# 4   A 2011-06-14     Land       FALSE   FALSE
# 5   A 2011-06-15      Ice       FALSE   FALSE
# 6   A 2011-06-16      Ice        TRUE   FALSE
# 7   A 2011-06-17     Land        TRUE    TRUE
# 8   A 2011-06-18     Land       FALSE   FALSE
# 9   A 2011-06-19     Land       FALSE   FALSE
# 10  A 2011-06-20     Land       FALSE   FALSE
# 11  B 2011-05-27     Land       FALSE   FALSE
# 12  B 2011-05-28     Land       FALSE   FALSE
# 13  B 2011-05-29     Land       FALSE   FALSE
# 14  B 2011-05-30      Ice       FALSE   FALSE
# 15  B 2011-05-31      Ice        TRUE   FALSE
# 16  B 2011-06-01     Land        TRUE    TRUE
# 17  B 2011-06-02     Land       FALSE   FALSE
# 18  B 2011-06-03     Land       FALSE   FALSE
# 19  B 2011-06-04      Ice       FALSE   FALSE
# 20  B 2011-06-05      Ice        TRUE   FALSE
mv1qrgav

mv1qrgav2#

library(dplyr)
left_join(data,
  data %>%
    arrange(ID, Dates) %>% # if not in OP order already
    group_by(ID, loc_grp = cumsum(Position != lag(Position, 1, ""))) %>%
    filter(Dates >= as.Date("2011-05-31"), Position == "Land", 
           n() >= 3, row_number() == 1) %>%
    ungroup() %>%
    transmute(ID, Dates, Position, Arrival = "Arrival"))

结果

Joining with `by = join_by(ID, Dates, Position)`
   ID      Dates Position Arrival
1   A 2011-06-11     Land    <NA>
2   A 2011-06-12      Ice    <NA>
3   A 2011-06-13     Land    <NA>
4   A 2011-06-14     Land    <NA>
5   A 2011-06-15      Ice    <NA>
6   A 2011-06-16      Ice    <NA>
7   A 2011-06-17     Land Arrival
8   A 2011-06-18     Land    <NA>
9   A 2011-06-19     Land    <NA>
10  A 2011-06-20     Land    <NA>
11  B 2011-05-27     Land    <NA>
12  B 2011-05-28     Land    <NA>
13  B 2011-05-29     Land    <NA>
14  B 2011-05-30      Ice    <NA>
15  B 2011-05-31      Ice    <NA>
16  B 2011-06-01     Land Arrival
17  B 2011-06-02     Land    <NA>
18  B 2011-06-03     Land    <NA>
19  B 2011-06-04      Ice    <NA>
20  B 2011-06-05      Ice    <NA>
68de4m5k

68de4m5k3#

不像其他解决方案那样简洁,而是逐步使用一些临时变量。

library(tidyverse)

ddf <- data |>
  arrange(ID, Dates) |>
  group_by(ID) |>
  mutate(n = lead(Position, n = 1)) |>
  mutate(nn = lead(Position, n = 2)) |>
  filter(Position == n & Position == nn & Dates > "2011-05-30") |>
  slice_head(n = 1) |>
  select(-(n:nn)) |>
  mutate(Arrival = "Arrival")

ddf |> right_join(data) |> arrange(ID, Dates)
#> Joining, by = c("ID", "Dates", "Position")
#> # A tibble: 20 × 4
#> # Groups:   ID [2]
#>    ID    Dates      Position Arrival
#>    <chr> <date>     <chr>    <chr>  
#>  1 A     2011-06-11 Land     <NA>   
#>  2 A     2011-06-12 Ice      <NA>   
#>  3 A     2011-06-13 Land     <NA>   
#>  4 A     2011-06-14 Land     <NA>   
#>  5 A     2011-06-15 Ice      <NA>   
#>  6 A     2011-06-16 Ice      <NA>   
#>  7 A     2011-06-17 Land     Arrival
#>  8 A     2011-06-18 Land     <NA>   
#>  9 A     2011-06-19 Land     <NA>   
#> 10 A     2011-06-20 Land     <NA>   
#> 11 B     2011-05-27 Land     <NA>   
#> 12 B     2011-05-28 Land     <NA>   
#> 13 B     2011-05-29 Land     <NA>   
#> 14 B     2011-05-30 Ice      <NA>   
#> 15 B     2011-05-31 Ice      <NA>   
#> 16 B     2011-06-01 Land     Arrival
#> 17 B     2011-06-02 Land     <NA>   
#> 18 B     2011-06-03 Land     <NA>   
#> 19 B     2011-06-04 Ice      <NA>   
#> 20 B     2011-06-05 Ice      <NA>
bzzcjhmw

bzzcjhmw4#

我希望你的 * 更喜欢使用dplyr * 意味着你仍然对其他可能性持开放态度:)如果是这样,这里有一个data.table的替代方案。

library(data.table)
setDT(data)

data[Dates > "2011-05-31",
     Arrival := if(.N > 2 & Position[1] == "Land") c("Arrival", rep(NA, .N - 1)),
     by = .(ID, rleid(Position))]

    ID      Dates Position Arrival
 1:  A 2011-06-11     Land    <NA>
 2:  A 2011-06-12      Ice    <NA>
 3:  A 2011-06-13     Land    <NA>
 4:  A 2011-06-14     Land    <NA>
 5:  A 2011-06-15      Ice    <NA>
 6:  A 2011-06-16      Ice    <NA>
 7:  A 2011-06-17     Land Arrival
 8:  A 2011-06-18     Land    <NA>
 9:  A 2011-06-19     Land    <NA>
10:  A 2011-06-20     Land    <NA>
11:  B 2011-05-27     Land    <NA>
12:  B 2011-05-28     Land    <NA>
13:  B 2011-05-29     Land    <NA>
14:  B 2011-05-30      Ice    <NA>
15:  B 2011-05-31      Ice    <NA>
16:  B 2011-06-01     Land Arrival
17:  B 2011-06-02     Land    <NA>
18:  B 2011-06-03     Land    <NA>
19:  B 2011-06-04      Ice    <NA>
20:  B 2011-06-05      Ice    <NA>
  • 说明 *:

选择相关行(Dates > "2011-05-31")。按"ID"和"位置"的连续运行创建组(by = .(ID, rleid(Position)))。在每个组中,if的行数大于2位置运行中的(.N > 2&值为"Land"Position[1] == "Land"),创建第一个值为"Arrival",其余值(.N-1)为NA的结果。通过引用添加新列(:=)。

6ovsh4lw

6ovsh4lw5#

这种dplyr方法使用相对(非硬编码)年份作为日期条件。rleid需要library(data.table)。可以替换,但非常方便。

library(dplyr)

data %>% 
  group_by(ID) %>% 
  mutate(grp = data.table::rleid(Position)) %>% 
  group_by(ID, grp) %>% 
  mutate(Arrival = if_else(n() >= 3 & Position == "Land" &  row_number() == 1 &
                     Dates > paste0(format(Dates, "%Y"), "-05-31"), 
                       "Arrival", NA_character_)) %>% 
  ungroup() %>% 
  select(-grp)
# A tibble: 20 × 4
   ID    Dates      Position Arrival
   <chr> <date>     <chr>    <chr>  
 1 A     2011-06-11 Land     NA     
 2 A     2011-06-12 Ice      NA     
 3 A     2011-06-13 Land     NA     
 4 A     2011-06-14 Land     NA     
 5 A     2011-06-15 Ice      NA     
 6 A     2011-06-16 Ice      NA     
 7 A     2011-06-17 Land     Arrival
 8 A     2011-06-18 Land     NA     
 9 A     2011-06-19 Land     NA     
10 A     2011-06-20 Land     NA     
11 B     2011-05-27 Land     NA     
12 B     2011-05-28 Land     NA     
13 B     2011-05-29 Land     NA     
14 B     2011-05-30 Ice      NA     
15 B     2011-05-31 Ice      NA     
16 B     2011-06-01 Land     Arrival
17 B     2011-06-02 Land     NA     
18 B     2011-06-03 Land     NA     
19 B     2011-06-04 Ice      NA     
20 B     2011-06-05 Ice      NA

相关问题