R语言用估计的边际平均值解释栅栏模型

我的目标是通过估计的边际均值来解释障碍模型的系数。我更喜欢解释概率（从logit量表反向转换），而不是对数优势（模型系数）或比值比（exp（对数优势））。我想使用emmeans()来实现这个目标，因为它与许多模型兼容，并且我有在线性模型和二项式GLM中使用它的经验。

栅栏模型的栅栏部分产生与二项式GLM相同的系数，这在其他地方已经评论过（here或here）。
然而，我不完全理解为什么，根据设置，从跨栏估计的平均值不匹配的二项式GLM。特别是

lin.pred = FALSE。这从二项式GLM产生不同的概率。在emmeansdocumentation之后，我认为它们是在概率尺度上平均的。
lin.pred = TRUE, type = "response"。此设置产生与二项式GLM相同的概率。根据emmeans文档，我认为它们是在logit标度上平均的。

我有两个问题：

1.如果我们的目标是解释概率，而不是logits，那么哪种设置更可取？
1.Inf自由度的含义是什么？
下面是一个可重复的例子，使用生物化学学生在至少3年的博士学位期间发表的论文数量，重点是两个因素水平（单身与已婚）之间的成对比较。
我的首选方法是与二项式GLM的估计概率相匹配的方法。我的理解是，已婚博士生比单身博士生更有可能发表至少一篇论文。已婚学生有74%的机会至少完成一篇论文，而单身学生有67%的机会：这7%的变化是一个很小的差异，并不显著。
提前感谢！

library(emmeans)
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
data("bioChemists", package = "pscl")
# str(bioChemists)
hurdle <- hurdle(art ~ ., data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glm <- glm(art > 0 ~ ., data = bioChemists, family = binomial(link = "logit"))

# Same coefficients
coef(summary(hurdle))$zero
#>                Estimate Std. Error    z value     Pr(>|z|)
#> (Intercept)  0.23679601 0.29551883  0.8012891 4.229643e-01
#> femWomen    -0.25115113 0.15910522 -1.5785222 1.144457e-01
#> marMarried   0.32623358 0.18081823  1.8042074 7.119880e-02
#> kid5        -0.28524872 0.11113043 -2.5667921 1.026441e-02
#> phd          0.02221940 0.07955710  0.2792887 7.800233e-01
#> ment         0.08012135 0.01301763  6.1548321 7.515710e-10
coef(glm)
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.23679601 -0.25115113  0.32623358 -0.28524872  0.02221940  0.08012135

####### 
## lin.pred = FALSE
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = FALSE)
#> $emmeans
#>  mar     emmean    SE  df lower.CL upper.CL
#>  Single   0.806 0.167 903    0.478     1.13
#>  Married  0.858 0.122 903    0.619     1.10
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df t.ratio p.value
#>  Single - Married  -0.0522 0.213 903  -0.245  0.8066
#> 
#> Results are averaged over the levels of: fem

####### 
## lin.pred = TRUE, type = "response"
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE, type = "response")
#> $emmeans
#>  mar      prob     SE  df lower.CL upper.CL
#>  Single  0.677 0.0306 903    0.615    0.734
#>  Married 0.744 0.0200 903    0.703    0.781
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale 
#> 
#> $contrasts
#>  contrast         odds.ratio   SE  df null t.ratio p.value
#>  Single / Married      0.722 0.13 903    1  -1.804  0.0715
#> 
#> Results are averaged over the levels of: fem 
#> Tests are performed on the log odds ratio scale

emmeans::emmeans(glm, specs = pairwise ~ mar, regrid = "response")
#> $emmeans
#>  mar     response     SE  df asymp.LCL asymp.UCL
#>  Single     0.677 0.0305 Inf     0.617     0.736
#>  Married    0.743 0.0201 Inf     0.704     0.783
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate     SE  df z.ratio p.value
#>  Single - Married  -0.0667 0.0377 Inf  -1.772  0.0764
#> 
#> Results are averaged over the levels of: fem

#######
# Estimates and contrasts at the logit scale appear to match
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE)
#> $emmeans
#>  mar     emmean    SE  df lower.CL upper.CL
#>  Single   0.741 0.140 903    0.466     1.02
#>  Married  1.068 0.105 903    0.862     1.27
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df t.ratio p.value
#>  Single - Married   -0.326 0.181 903  -1.804  0.0715
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the log odds ratio (not the response) scale.

emmeans::emmeans(glm, specs = pairwise ~ mar)
#> $emmeans
#>  mar     emmean    SE  df asymp.LCL asymp.UCL
#>  Single   0.741 0.140 Inf     0.467      1.02
#>  Married  1.068 0.105 Inf     0.862      1.27
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df z.ratio p.value
#>  Single - Married   -0.326 0.181 Inf  -1.804  0.0712
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the log odds ratio (not the response) scale.

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pscl_1.5.5.1          emmeans_1.8.5-9000004
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.2.0      highr_0.9           tools_4.2.0        
#>  [4] digest_0.6.29       evaluate_0.15       lifecycle_1.0.3    
#>  [7] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
#> [10] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
#> [13] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
#> [16] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
#> [19] stringr_1.5.0       knitr_1.39          fs_1.5.2           
#> [22] vctrs_0.6.3         grid_4.2.0          glue_1.6.2         
#> [25] survival_3.3-1      rmarkdown_2.14      multcomp_1.4-19    
#> [28] TH.data_1.1-1       magrittr_2.0.3      codetools_0.2-18   
#> [31] htmltools_0.5.6     splines_4.2.0       MASS_7.3-57        
#> [34] xtable_1.8-4        numDeriv_2016.8-1.1 sandwich_3.0-1     
#> [37] estimability_1.4.1  stringi_1.7.6       zoo_1.8-10

创建于2023-09-06使用reprex v2.0.2

Russell Lenth（emmeans包的开发者）在GitHub上提供了一个答案。我把它贴在这里，用emmeans和glmmTMB拟合的障碍模型之间的比较，显示了一致的结果。

library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.2.3
library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.2.3
#> Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
#> TMB was built with Matrix version 1.6.1
#> Current Matrix version is 1.4.1
#> Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis

# Create datasets: number of papers during a Phd
data("bioChemists", package = "pscl")

# Declare models
pscl.hurdle <- pscl::hurdle(art ~ fem + mar, data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glmmtmb.hurdle <- glmmTMB::glmmTMB(art ~ fem + mar, data = bioChemists, family = truncated_poisson(link = "log"), zi = ~ fem + mar)

# Truncated count
emmeans(pscl.hurdle, ~ mar, mode = "count", lin.pred = TRUE, type = "response")
#>  mar     count     SE  df lower.CL upper.CL
#>  Single   2.07 0.1122 909     1.86     2.31
#>  Married  2.09 0.0807 909     1.94     2.26
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the log scale
emmeans(glmmtmb.hurdle, ~ mar, comp = "cond", type = "response")
#>  mar     rate     SE  df asymp.LCL asymp.UCL
#>  Single  2.07 0.1122 Inf      1.86      2.30
#>  Married 2.09 0.0807 Inf      1.94      2.26
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the log scale

# Binomial
emmeans(pscl.hurdle, ~ mar, mode = "prob0")
#>  mar     emmean     SE  df lower.CL upper.CL
#>  Single   0.313 0.0265 909    0.261    0.365
#>  Married  0.297 0.0191 909    0.259    0.334
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95
emmeans(glmmtmb.hurdle, ~ mar, comp = "zi", type = "response")
#>  mar     response     SE  df asymp.LCL asymp.UCL
#>  Single     0.313 0.0267 Inf     0.263     0.367
#>  Married    0.296 0.0190 Inf     0.261     0.335
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale

# Response (binomial * truncated count)
emmeans(pscl.hurdle, ~ mar, mode = "response")
#>  mar     emmean     SE  df lower.CL upper.CL
#>  Single    1.64 0.0900 909     1.47     1.82
#>  Married   1.69 0.0633 909     1.57     1.82
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95
emmeans(glmmtmb.hurdle, ~ mar, comp = "response")
#>  mar     emmean     SE  df asymp.LCL asymp.UCL
#>  Single    1.64 0.0900 Inf      1.47      1.82
#>  Married   1.69 0.0633 Inf      1.57      1.82
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pscl_1.5.5.1  glmmTMB_1.1.7 emmeans_1.8.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10         nloptr_2.0.1        compiler_4.2.0     
#>  [4] highr_0.9           TMB_1.9.6           tools_4.2.0        
#>  [7] boot_1.3-28         lme4_1.1-29         digest_0.6.29      
#> [10] nlme_3.1-157        evaluate_0.15       lifecycle_1.0.3    
#> [13] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
#> [16] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
#> [19] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
#> [22] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
#> [25] stringr_1.5.0       knitr_1.39          fs_1.5.2           
#> [28] vctrs_0.6.3         grid_4.2.0          glue_1.6.2         
#> [31] survival_3.3-1      rmarkdown_2.14      multcomp_1.4-19    
#> [34] minqa_1.2.4         TH.data_1.1-1       magrittr_2.0.3     
#> [37] codetools_0.2-18    htmltools_0.5.6     splines_4.2.0      
#> [40] MASS_7.3-57         xtable_1.8-4        numDeriv_2016.8-1.1
#> [43] sandwich_3.0-1      estimability_1.4.1  stringi_1.7.6      
#> [46] zoo_1.8-10

创建于2023-09-14带有reprex v2.0.2

R语言用估计的边际平均值解释栅栏模型

1条答案

相关问题

热门标签

最新问答

R语言 用估计的边际平均值解释栅栏模型

1条答案

相关问题

热门标签

最新问答

R语言用估计的边际平均值解释栅栏模型