R语言 ggplot2:具有正态曲线的直方图

pqwbnv8z  于 2022-12-27  发布在  其他
关注(0)|答案(5)|浏览(291)

我一直试图用ggplot 2在直方图上叠加一条正态曲线。
我的公式:

data <- read.csv (path...)

ggplot(data, aes(V2)) + 
  geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)

我尝试了几种方法:

+ stat_function(fun=dnorm)

......没有改变任何事情

+ stat_density(geom = "line", colour = "red")

...在x轴上显示了一条红色直线。

+ geom_density()

不适用于我,因为我希望将频率值保留在y轴上,并且不需要密度值。
有什么建议吗?
找到解决方案!
第一个月

p1tboqfb

p1tboqfb1#

我想我明白了:

library(ggplot2)

set.seed(1)

df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) + 
    geom_histogram(aes(y =..density..),
                   breaks = seq(-50, 50, by = 10), 
                   colour = "black", 
                   fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))

drkbr07n

drkbr07n2#

这一问题已得到here和部分here的回答。
密度曲线下的面积等于1,直方图下的面积等于条形的宽度乘以其高度之和,即条柱宽度乘以非缺失观测值的总数。要在同一图形上拟合这两个数据,需要重新调整其中一个数据的比例以使其面积匹配。
如果希望y轴具有频率计数,则有许多选项:
首先模拟一些数据。

library(ggplot2)

set.seed(1)
dat_hist <- data.frame(
  group = c(rep("A", 200), rep("B",150)),
  value = c(rnorm(200, 20, 5), rnorm(150,25,10)))

# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))

选项1:绘制直方图和密度曲线作为密度,然后重新调整y轴

这可能是绘制单个直方图最简单的方法。使用卡洛斯建议的方法,将直方图和密度曲线都绘制为密度

g <- ggplot(dat_hist, aes(value))  + 
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") + 
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))

然后重新调整y轴。

ybreaks = seq(0,50,5) 
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)

## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
  trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))

选项2:使用stat_function重新缩放密度曲线

按照PatrickT的回答整理代码。

ggplot(dat_hist, aes(value))  + 
  geom_histogram(colour = "black", binwidth = bw) + 
  stat_function(fun = function(x) 
    dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)

选项3:使用geom_line创建外部数据集和绘图。

与上面的选项不同,这个选项使用facets。(编辑后提供dplyr而不是基于plyr的解决方案)注意,汇总的数据集被用作主要数据集,原始数据仅用于直方图。

library(tidyverse)

dat_hist %>% 
  group_by(group) %>% 
  nest(data = c(value)) %>% 
  mutate(y = map(data, ~ dnorm(
    .$value, mean = mean(.$value), sd = sd(.$value)
    ) * bw * sum(!is.na(.$value)))) %>% 
  unnest(c(data,y)) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
  geom_line(aes(y = y)) + 
  facet_wrap(~ group)

选项4:创建外部函数以动态编辑数据

也许有点过了头,但可能对某人有用?

## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
  .x <- na.omit(data[,x])
  if(is.null(xlim))
    xlim = c(min(.x), max(.x))
  x_range = seq(xlim[1], xlim[2], length.out = 101)
  setNames(
    data.frame(
    x = x_range,
    y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
    c(x, "y"))
}

## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
  dat_hists <- lapply(
    split(data, data[, group]), dnorm_scaled,
      x = x, binwidth = binwidth, xlim = xlim)
  for(g in names(dat_hists))
    dat_hists[[g]][, "group"] <- g
  setNames(do.call(rbind, dat_hists), c(x, "y", group))
}

## Single histogram
ggplot(dat_hist, aes(value)) + 
  geom_histogram(binwidth = bw, colour = "black") + 
  geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw), 
            aes(y = y)) 

## With a single faceting variable
ggplot(dat_hist, aes(value))  + 
  geom_histogram(binwidth = 2, colour = "black") + 
  geom_line(data = ~ dnorm_scaled_group(
    ., x = "value", group = "group", binwidth = 2, xlim = c(0,50)), 
    aes(y = y)) +
  facet_wrap(~ group)
6tdlim6h

6tdlim6h3#

这是对JWilliman的答案的扩展评论。我发现J的答案非常有用。在尝试的过程中,我发现了一种简化代码的方法。我不是说这是一种更好的方法,但我认为我应该提到它。
请注意,JWilliman的答案提供了y轴上的计数和"hack",以缩放相应的密度正态近似(否则将覆盖总面积1,因此具有低得多的峰值)。
此评论的要点:stat_function内部的语法更简单,通过将所需参数传递给美学函数,例如

    • 一米一米一**

这避免了将args =传递给stat_function,因此更加用户友好。好吧,这不是很不同,但希望有人会觉得有趣。

# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))

ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
    theme_bw() +
    geom_histogram(binwidth = binwidth, 
        colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
    color = "darkred", size = 1)

bakd9h0s

bakd9h0s4#

这段代码应该可以做到:

set.seed(1)
z <- rnorm(1000)

qplot(z, geom = "blank") + 
geom_histogram(aes(y = ..density..)) + 
stat_density(geom = "line", aes(colour = "bla")) + 
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) + 
scale_colour_manual(name = "", values = c("red", "green"), 
                               breaks = c("bla", "blabla"), 
                               labels = c("kernel_est", "norm_curv")) + 
theme(legend.position = "bottom", legend.direction = "horizontal")

注意:我使用的是qplot,但您可以使用功能更丰富的ggplot。

vuktfyat

vuktfyat5#

以下是tidyverse的知情版本:

设置

library(tidyverse)

一些数据

d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")

准备数据

我们将对整个样本使用"总计"直方图,为此,我们需要从数据中删除分组信息。

d2 <-
  d |> 
  select(-gender)

下面是一个包含汇总数据的数据集:

d_summary <-
  d %>% 
  group_by(gender) %>% 
  summarise(height_m = mean(height, na.rm = T),
            height_sd = sd(height, na.rm = T))

d_summary

∮画出来∮

d %>% 
  ggplot() +
  aes() +
  geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
  facet_wrap(~ gender) +
  geom_histogram(data = d2, aes(y = ..density.., x = height), 
                 alpha = .5) +
  stat_function(data = d_summary %>% filter(gender == "female"),
                fun = dnorm,
                #color = "red",
                args = list(mean = filter(d_summary, 
                                          gender == "female")$height_m,
                            sd = filter(d_summary, 
                                        gender == "female")$height_sd)) +
  stat_function(data = d_summary %>% filter(gender == "male"),
                fun = dnorm,
                #color = "red",
                args = list(mean = filter(d_summary, 
                                          gender == "male")$height_m,
                            sd = filter(d_summary, 
                                        gender == "male")$height_sd)) +
  theme(legend.position = "none",
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  labs(title = "Facetted histograms with overlaid normal curves",
       caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
  scale_fill_brewer(type = "qual", palette = "Set1")

相关问题