dplyr通过两个条件进行过滤总结

rmbxnbpk  于 2023-04-18  发布在  其他
关注(0)|答案(1)|浏览(115)

我的数据与下面的foo没有什么不同。观察结果按时间顺序排列,并对不同的组进行测量。

library(tidyverse)
n <- 500
r <- 0.001
a <- 5
foo <- data.frame(Time = 1:n) %>%
  mutate(A = a*(1+r)^Time + rnorm(n), B = rnorm(n,mean=a,sd=2)) %>%
  pivot_longer(cols = -Time, names_to = "Group", values_to = "Measurement")

foo %>% ggplot(mapping = aes(x=Time,y=Measurement,color=Group)) +
  geom_line()

我想知道的是,在给定的时间段内(比如在这个例子中〉250),有多少测量值高于或低于整个数据集上一组的给定分位数值。例如,使用整个数据集计算的第10、第50和第90分位数。

fooQuants <- foo %>% group_by(Group) %>%
  reframe(Measurement=quantile(Measurement, probs = c(0.1,0.5,0.9))) %>%
  add_column(Quant = rep(paste0("Q",c(0.1,0.5,0.9)),2))

# eg for group A with Time > 250 this is the number of times that Measurement is above 90th quant
A90th <- fooQuants %>% filter(Group == "A", Quant == "Q0.9") %>% pull(Measurement)
foo %>% filter(Time > 250, Group == "A") %>% summarise(n=sum(Measurement >= A90th))
# eg for group B with Time > 250 this is the number of times that Measurement is above 90th quant
B90th <- fooQuants %>% filter(Group == "B", Quant == "Q0.9") %>% pull(Measurement)
foo %>% filter(Time > 250, Group == "B") %>% summarise(n=sum(Measurement >= B90th))

我想对foo中的所有组和fooQuants中的所有不同分位数水平进行此操作。

#foo %>% filter(Time > 250) %>% select(Group,Measurement) %>%
#  group_by(Group) %>%
#  summarise(`the number of instances where the measurement is gt or lt each quantile in fooQuants`)

感谢小费!

gojuced7

gojuced71#

library(dplyr) # v1.1.0+ for "join_by"
foo %>%
  left_join(fooQuants, join_by(Group, closest(Measurement >= Measurement))) %>% 
  mutate(Quant = coalesce(Quant, "Q0")) %>%
  filter(Time > 250) %>%
  count(Group, Quant) %>%
  mutate(share = n / sum(n), .by = Group)

结果

# A tibble: 8 × 4
  Group Quant     n share
  <chr> <chr> <int> <dbl>
1 A     Q0        4 0.016
2 A     Q0.1     65 0.26 
3 A     Q0.5    131 0.524
4 A     Q0.9     50 0.2  
5 B     Q0       28 0.112
6 B     Q0.1     92 0.368
7 B     Q0.5    105 0.42 
8 B     Q0.9     25 0.1

相关问题