R语言 `ggdist::stat_halfeye()`如何缩放后验预测密度

vohkndzv  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(87)

我的目标是使用R中的贝叶斯零膨胀beta模型计算95%的预测区间。我这样做没有问题,但当我使用stat_halfeye()绘制后验预测密度时,密度看起来非常平坦。这一定是由于数据中的零(密度在零处是无限的)以及密度在零和一之间缩放的方式?!我试图通过geom_density()使用缩放密度(从0到1)来重现这种行为,以更好地理解stat_halfeye()的作用,但geom_density()看起来并不平坦。事实上,它看起来类似于使用stat_histinterval()的直方图方法。
我的问题是stat_halfeye()如何扩展密度?我寻找?stat_halfeye?stat_slabinterval,但无法找到答案。
这里有一个可重复的例子:

library(tidyverse)
library(tidybayes)
library(brms)

set.seed(1232)
# set zero inflation
zero_inflation <- 0.3
# generate roughly 30% zeros and 70% ones
zero_obs <- rbinom(n = 30,
                   size = 1,
                   prob = 1 - zero_inflation)
# generate beta random observations
cover_obs <- rbeta(n = 30, shape1 = 2, shape2 = 9)
# multiply the above vectors to get the final zero-inflated observation vector
combined_obs <- tibble(y_beta_zero = zero_obs * cover_obs)

# fit zero-inflated beta model
beta_zero_model <-
  brm(y_beta_zero ~ 1, family = zero_inflated_beta(),
      data = combined_obs)

# now I want the 95% prediction interval from the posterior predictive
combined_obs %>%
  add_predicted_draws(beta_zero_model) %>%
  ggplot(aes(x = .prediction)) +
  stat_halfeye(.width = c(.95)) + 
  ggtitle("stat_halfeye")

字符串
x1c 0d1x的数据

# It looks like the density in stat_halfeye is scaled (y axis from 0 to 1)
# Tried to reproduce with the `geom_density()` function but it didn't work.

combined_obs %>%
  add_predicted_draws(beta_zero_model) %>%
  ggplot(aes(x = .prediction, after_stat(scaled))) + 
  geom_density() +
  ggtitle("geom_density, scaled")


# also tried the histogram approach and this looks more like the `geom_density()` approach
combined_obs %>%
  add_predicted_draws(beta_zero_model) %>%
  ggplot(aes(x = .prediction)) +
  stat_histinterval(.width = c(.95)) +
  ggtitle("stat_histinterval")

创建于2023-12-05使用reprex v2.0.2

vtwuwzda

vtwuwzda1#

这里的主要区别在于,这两种方法默认使用不同的带宽估计值。ggplot2::geom_density()使用默认带宽估计值density(),即"nrd0"(即bw.nrd0()),而ggdist::stat_slabinterval()使用"dpi"值得注意的是,density()的文档建议使用"SJ"而不是"nrd0",由于历史原因,"nrd0"只是该函数的默认值。
你可以通过在同一尺度上用两个估计量绘制密度图来看到发生了什么。我将使用ggdist::stat_slab()而不是stat_slabinterval(),因为间隔使得在这里更难看到比较:

combined_obs %>%
  add_predicted_draws(beta_zero_model) %>%
  ggplot(aes(x = .prediction)) +
  stat_slab(density = ggdist::density_bounded(bandwidth = "nrd0"), color = "gray25") +
  stat_slab(color = "red", fill = NA, linetype = "11") +  
  # ensure both slabs are scaled together
  scale_thickness_shared() +
  theme_ggdist()

字符串
x1c 0d1x的数据
灰色板块(大致)是从geom_density()得到的结果(除了ggdist::stat_slab()还对数据的有界性进行了校正,这将影响尖峰的大小,但这与带宽差异相比相形见绌)。红色虚线是默认情况下从ggdist::stat_slab()得到的结果。
这有点难看清楚,所以让我们放大:

combined_obs %>%
  add_predicted_draws(beta_zero_model) %>%
  ggplot(aes(x = .prediction)) +
  stat_slab(density = ggdist::density_bounded(bandwidth = "nrd0"), color = "gray25") +
  stat_slab(color = "red", fill = NA, linetype = "11") +  
  scale_thickness_shared() +
  # zoom in
  coord_cartesian(ylim = c(0, 0.2)) +
  theme_ggdist()



基本上发生的是ggdist的默认带宽估计器选择了一个较小的带宽,所以尖峰更窄更高。除此之外,他们基本上同意分布的其余部分的形状。可以说(尽管我在这里有偏见),ggdist的选择更准确,因为它没有高密度的0附近(但不等于)的厚部分值。
不过,最终,密度或直方图中的点质量的问题是,你可以通过调整带宽来使它们任意高。这里有一个直方图的例子:

combined_obs %>%
  add_predicted_draws(beta_zero_model) %>%
  ggplot(aes(x = .prediction)) +
  stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.05), align = "boundary", color = "gray70", fill = NA) +
  stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.04), align = "boundary", color = "gray60", fill = NA) +
  stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.03), align = "boundary", color = "gray50", fill = NA) +
  stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.02), align = "boundary", color = "gray40", fill = NA) +
  stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.01), align = "boundary", color = "gray30", fill = NA) +
  scale_thickness_shared() +
  theme_ggdist()

这里正确的解决方案可能是类似于零的比例以及分布的其余部分的密度的图;目前ggdist没有现成的解决方案,但有一个open issue与此相关。

相关问题