R noob(仍然)在这里,在tidyverse
/ RStudio中工作。
我有一个整洁的数据集,其中每一行都有一个日期、一个分组特征和一个值(实际的数据集更复杂,但这是它的核心):
对于每个Date
,我按Group
对数据进行分组,并计算Value
的一些摘要统计信息,生成每个日期的分组摘要。
grouped <- data %>% group_by(Date, Group) %>% summarise(mean = mean(Value))
head(grouped)
# A tibble: 6 × 3
# Groups: Date [4]
Date Group mean
<date> <fct> <dbl>
1 2021-02-18 A 37.4
2 2021-02-19 B 25.5
3 2021-02-19 A 26.1
4 2021-02-22 B 34.2
5 2021-02-22 A 26.4
6 2021-02-23 B 34.2
(Note:数据如下所示,用于重现性。)
到目前为止一切顺利。现在我想用Group
来计算这些汇总统计信息的移动平均值(在本例中为mean
,但也可以是其他值)。我用zoo::rollmean
尝试了以下方法:
grouped <- grouped %>%
group_by(Group) %>%
mutate(rolling = zoo::rollmean(mean, window_length, fill=NA))
但这里出现了一个问题--理想情况下,移动平均线应该是严格意义上的某个数字天,而不是记录,但一组或两组都缺少一些天。
确保移动平均线正确考虑缺失天数x组,并根据需要将其视为NA
的最佳方法是什么?
(我从this answer了解到zoo::rollmean
不能处理NA
值,但zoo::rollapply
应该可以。)
我尝试过创建一个简单的日历 Dataframe ,其中包含分组数据到join
的完整日期集,但是Group
变量也保留为NA
,因此rollmean / rollapply
函数仍然忽略x组中缺失的日期。
希望这一切都说得通!
data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680,
18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680,
18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683,
18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683,
18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684,
18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685,
18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687,
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688,
18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689,
18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690,
18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691,
18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692,
18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694,
18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694,
18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695,
18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696,
18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697,
18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698,
18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701,
18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701,
18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703), class = "Date"), Group = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"),
Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93,
28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78,
25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76,
29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6,
32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12,
16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69,
39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61,
15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39,
19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25,
34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21,
20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85,
17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87,
19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35,
31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6,
41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49,
20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32,
38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68,
14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03,
29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42,
37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3,
37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92,
15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74,
36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33,
16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33,
38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44,
30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53,
23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75,
19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
)), row.names = c(NA, -250L), class = c("tbl_df", "tbl",
"data.frame"))
1条答案
按热度按时间kq4fsx7k1#