R:根据组标准过滤行

unguejic  于 2023-02-20  发布在  其他
关注(0)|答案(2)|浏览(129)

我有一个超过100个的 Dataframe ,000行,大约40列。schools列大约有100所不同的学校。我有1980年到2023年的数据。我希望保留2018年到2022年每年至少有10行的学校的所有数据。不符合该条件的学校应删除所有行。在我的最小示例Schools中,我有三所学校。计算一个表可以明显看出只有华盛顿应该保留。亚当斯只有5行表示2018年,Jefferson有0行表示2018年。Schools2是结果应该是什么样子。我如何使用表计算或dplyr计算来执行过滤?

Schools = 
  data.frame(school = c(rep('Washington', 60),
                        rep('Adams',70),
                        rep('Jefferson', 100)), 
             year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5), 
                      rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
                      rep(2019:2023, each = 20)),
             stuff = rnorm(230)
             )

Schools2 = 
  data.frame(school = c(rep('Washington', 60)),
             year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)), 
             stuff = rnorm(60)
  )

table(Schools$school, Schools$year)

Schools |> group_by(school, year) |> summarize(counts = n())
g6ll5ycj

g6ll5ycj1#

filter的数据中仅保留2018年至2022年,然后按学校、年份和filter添加频率计数列(仅"学校"),所有计数均大于或等于10,并且如果存在all,则为范围中的年份

library(dplyr)# version >= 1.1.0
Schools %>%
   filter(all(table(year[year %in% 2018:2022]) >= 10) & 
          all(2018:2022 %in% year), .by = c("school")) %>%
   as_tibble()
  • 输出
# A tibble: 60 × 3
   school      year   stuff
   <chr>      <dbl>   <dbl>
 1 Washington  2016  0.680 
 2 Washington  2016 -1.14  
 3 Washington  2016  0.0420
 4 Washington  2016 -0.603 
 5 Washington  2016  2.05  
 6 Washington  2018 -0.810 
 7 Washington  2018  0.692 
 8 Washington  2018 -0.502 
 9 Washington  2018  0.464 
10 Washington  2018  0.397 
# … with 50 more rows

或者使用count

library(magrittr)
Schools %>% 
  filter(tibble(year) %>%
        filter(year %in% 2018:2022) %>%
        count(year) %>%
       pull(n) %>% 
       is_weakly_greater_than(10) %>% 
       all, all(2018:2022 %in% year) , .by = "school")
7fyelxc5

7fyelxc52#

事实证明,一个朋友刚刚帮我想出了一个基R的解决方案。

# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)

# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]

# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))

# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)

# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

相关问题