在R中使用group_by预测值

uz75evzq  于 2023-04-18  发布在  其他
关注(0)|答案(3)|浏览(155)

我想创建一个列,其中包含先前分组的回归预测值。
我试过这个:
数据

city<-c("a","a","a","b","b","b","a")
gender<-c("male","female","female","male","male","female","male")
age<-c(24,25,26,78,65,34,23)
death<-c(0,0,1,1,0,0,0)

df<-data.frame(city,gender,age,death)

验证码:

df_1<-df%>%
  group_by(city)%>%
  glm(death~gender+age,data=df,family="poisson")%>%
  mutate(death_p=predict(glm))

结果

Error in model.frame.default(formula = ., data = df, weights = death ~  : 
  invalid type (language) for variable '(weights)'
blpfk2vs

blpfk2vs1#

1.“永远不要”在基于df的管道中使用df。在任何情况下,如果数据被附加,过滤,扩充,或(如本例中)* 分组 *,重用df不会给予你预期的结果。请使用cur_data()
1.我们可以将模型存储为列表列。在这种情况下,因为我们没有汇总它,所以它的效率有点低,因为它会将模型的冗余副本存储在组内的每一行中,但是......我们现在可以接受这种情况。
试试这个:

out <- df %>%
  group_by(city) %>%
  mutate(
    mdl = list(glm(death ~ gender + age, data=cur_data(), family="poisson")), 
    pred = predict(mdl[[1]], newdata = cur_data(), family = "poisson")
  ) %>%
  ungroup()
out
# Warning: Problem while computing `mdl = list(glm(death ~ gender + age, data = cur_data(), family = "poisson"))`.
# ℹ glm.fit: fitted rates numerically 0 occurred
# ℹ The warning occurred in group 1: city = "a".
# # A tibble: 7 × 6
#   city  gender   age death mdl         pred
#   <chr> <chr>  <dbl> <dbl> <list>     <dbl>
# 1 a     male      24     0 <glm>  -2.31e+ 1
# 2 a     female    25     0 <glm>  -2.31e+ 1
# 3 a     female    26     1 <glm>   0       
# 4 b     male      78     1 <glm>  -2.84e-14
# 5 b     male      65     0 <glm>  -2.33e+ 1
# 6 b     female    34     0 <glm>  -2.33e+ 1
# 7 a     male      23     0 <glm>  -4.61e+ 1

您可以对mdl列执行其他操作,例如提取一些摘要:

out %>%
  group_by(city) %>%
  summarize(smry = list(summary(mdl[[1]]))) %>%
  pull(smry)
# [[1]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals: 
#          1           2           3           4  
# -1.389e-05  -1.389e-05   0.000e+00  -2.110e-08  
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -599.62 1606083.51       0        1
# gendermale       23.06  138127.43       0        1
# age              23.06   61772.44       0        1
# (Dispersion parameter for poisson family taken to be 1)
#     Null deviance: 2.7726e+00  on 3  degrees of freedom
# Residual deviance: 3.8564e-10  on 1  degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 22
# [[2]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals: 
# [1]  0  0  0
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -84.248 195033.586       0        1
# gendermale     -55.568 245825.832       0        1
# age              1.793   5357.985       0        1
# (Dispersion parameter for poisson family taken to be 1)
#     Null deviance: 2.1972e+00  on 2  degrees of freedom
# Residual deviance: 3.0330e-10  on 0  degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 21
rta7y2nd

rta7y2nd2#

我们可以使用do()和一些小的改动来实现,比如在glm()函数中包含公式参数,权重可以设置为NULL:

library(dplyr)

df %>%
  group_by(city) %>%
  do(data.frame(., death_p = predict(glm(death ~ gender + age, data = ., family = "poisson"))))

  city  gender   age death   death_p
  <chr> <chr>  <dbl> <dbl>     <dbl>
1 a     male      24     0 -2.31e+ 1
2 a     female    25     0 -2.31e+ 1
3 a     female    26     1  0       
4 a     male      23     0 -4.61e+ 1
5 b     male      78     1 -2.84e-14
6 b     male      65     0 -2.33e+ 1
7 b     female    34     0 -2.33e+ 1
u5rb5r59

u5rb5r593#

下面是一个data.table版本

library(data.table)

setDT(df)[, death_p:=exp(predict(glm(death~age+gender, family="poisson"))), city]

输出:

city gender age death      death_p
1:    a   male  24     0 9.640864e-11
2:    a female  25     0 9.640864e-11
3:    a female  26     1 1.000000e+00
4:    b   male  78     1 1.000000e+00
5:    b   male  65     0 7.582560e-11
6:    b female  34     0 7.582560e-11
7:    a   male  23     0 9.294626e-21

相关问题