对于每个单独的id，筛选在year列中具有连续值的行

csbfibhn 于 2023-04-09 发布在其他

关注(0)|答案(4)|浏览(114)

我想从下面的dataframe创建一个平衡的面板数据：

id  program_year  value
1     2007         1
1     2008         1
1     2009         1
1     2010         1
1     2011         1
1     2012         1
1     2013         0
2     2007         0
2     2008         1
2     2009         1
2     2010         1
2     2011         1  
2     2012         1
2     2013         1
3     2007         1
3     2008         0
3     2009         1
3     2010         1
3     2011         1
3     2012         1
3     2013         1

对于每个id，我想选择program_year列中具有连续value == 1的 5 行。
预期的输出应该如下所示：

id  program_year  value
1     2007         1
1     2008         1
1     2009         1
1     2010         1
1     2011         1
2     2008         1
2     2009         1
2     2010         1
2     2011         1  
2     2012         1
3     2009         1
3     2010         1
3     2011         1
3     2012         1
3     2013         1

我已经用lead()和lag()进行了探索，但没有任何成功。获得所需输出后的下一步是索引年份，使dataframe成为一个平衡的面板。

来源：https://stackoverflow.com/questions/75903945/for-each-individual-id-filter-rows-that-have-consecutive-values-in-the-year-col

4条答案

按热度按时间

f4t66c6m1#

不知道这是否是你需要的：

Data <- "id  program_year  value
1     2007         1
1     2008         1
1     2009         1
1     2010         1
1     2011         1
1     2012         1
1     2013         0
2     2007         0
2     2008         1
2     2009         1
2     2010         1
2     2011         1  
2     2012         1
2     2013         1
3     2007         1
3     2008         0
3     2009         1
3     2010         1
3     2011         1
3     2012         1
3     2013         1"

DF <- read.table(text = Data, header = TRUE)

library(dplyr)

DF %>%
  arrange(id, program_year) %>%
  group_by(id) %>%
  filter((program_year - lag(program_year)) >= 1) %>% 
  mutate(consecutive = program_year - row_number()) %>%
  group_by(id, consecutive) %>%
  filter(n() >= 5) %>%
  slice_head(n = 5) %>%
  ungroup() %>%
  filter(value == 1) %>%
  select(id, program_year, value)

它返回以下内容：

# A tibble: 14 × 3
  id program_year value

  <int>        <int> <int>
 1     1         2008     1
 2     1         2009     1
 3     1         2010     1
 4     1         2011     1
 5     1         2012     1
 6     2         2008     1
 7     2         2009     1
 8     2         2010     1
 9     2         2011     1
10     2         2012     1
11     3         2009     1
12     3         2010     1
13     3         2011     1
14     3         2012     1

修改了答案以满足您的条件where value == 1

赞(0）回复(0）举报 2023-04-09

yyyllmsg2#

这里有一个使用dplyr::consecutive_id()和每操作分组的解决方案。确保你使用的是最新版本的dplyr。

library(dplyr) # >= v1.1.0

dat %>%
  mutate(c_id = consecutive_id(value), .by = id) %>%
  filter(value == 1, n() >= 5, .by = c(id, c_id)) %>%
  filter(row_number() <= 5, .by = id) %>%
  select(!c_id)

id program_year value
1   1         2007     1
2   1         2008     1
3   1         2009     1
4   1         2010     1
5   1         2011     1
6   2         2008     1
7   2         2009     1
8   2         2010     1
9   2         2011     1
10  2         2012     1
11  3         2009     1
12  3         2010     1
13  3         2011     1
14  3         2012     1
15  3         2013     1

赞(0）回复(0）举报 2023-04-09

6yoyoihd3#

赞(0）回复(0）举报 2023-04-09

e5nszbig4#

在by中，对于每个ID，我们可以首先为非零值创建subset，为连续值的集合创建一个组u，和subsetx，对于可能在table中产生which.max数量的观察的组。接下来，我们head列表-例如"by"对象，对于min最大观测数，最后是rbind。

by(dat, dat$id, \(x) {
  x <- subset(x, x$value == 1)
  u <- cumsum(c(1, diff(x$program_year)) != 1) + 1
  tbl <- table(u)
  subset(x, u == which.max(tbl))
}) |> {\(.) lapply(., \(x) {
  m <- min(sapply(., nrow))
  transform(head(x, m), period=seq_len(m))
})}() |>  ## or `tail` instead of `head`
  do.call(what='rbind')
#      id program_year value period
# 1.1   1         2007     1      1
# 1.2   1         2008     1      2
# 1.3   1         2009     1      3
# 1.4   1         2010     1      4
# 1.5   1         2011     1      5
# 2.9   2         2008     1      1
# 2.10  2         2009     1      2
# 2.11  2         2010     1      3
# 2.12  2         2011     1      4
# 2.13  2         2012     1      5
# 3.17  3         2009     1      1
# 3.18  3         2010     1      2
# 3.19  3         2011     1      3
# 3.20  3         2012     1      4
# 3.21  3         2013     1      5

给出给定ID的连续观测的最大可能子集，尽管如OP中所要求的，具有不匹配年，但添加了新的周期变量。

数据：*

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), program_year = c(2007L, 
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2007L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L), value = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-21L))

赞(0）回复(0）举报 2023-04-09

我来回答

对于每个单独的id，筛选在year列中具有连续值的行

4条答案

相关问题

热门标签

最新问答