R语言 从GLM系数提取参考电平

k5ifujac  于 2022-12-20  发布在  其他
关注(0)|答案(2)|浏览(145)

我知道参考电平不包括在内,但我希望有一种方法能够采取拟合glm对象,并找出什么是参考电平(即不使用原始数据集的知识)。这是存储在glm拟合对象的任何地方吗?
示例数据如下:

> btest <- data.frame(var1 = sample(c(1,2,3), 100, replace = T),
+                     var2 = sample(c('a','b','c'), 100, replace = T),
+                     var3 = sample(c('e','f','g'), 100, replace = T),
+                     var4 = rnorm(100, mean = 3, 2),
+                     var5 = sample(c('yes','no'), 100, replace = T))
> summary(glm(var5 ~ var1 + var2 + var3 + var4, data = btest, family = 'binomial'))

Call:
glm(formula = var5 ~ var1 + var2 + var3 + var4, family = "binomial", 
    data = btest)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6988  -1.0457  -0.6213   1.1224   1.8904  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.81827    0.73173  -1.118   0.2635  
var1         0.55923    0.27279   2.050   0.0404 *
var2b       -0.60998    0.53435  -1.142   0.2536  
var2c       -0.60250    0.51706  -1.165   0.2439  
var3f       -0.81899    0.53345  -1.535   0.1247  
var3g        0.21215    0.51907   0.409   0.6828  
var4         0.04429    0.12650   0.350   0.7263  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 137.99  on 99  degrees of freedom
Residual deviance: 128.35  on 93  degrees of freedom
AIC: 142.35

Number of Fisher Scoring iterations: 4

这里我想知道var1var4没有引用,但是var2var3的引用水平分别是'a''e',因为我最终输出的是一个表,其中包含这些变量在这些引用水平下的NAEstimate
编辑:对于后来来的人,我也想知道当结合下面的答案时,挖掘glm拟合对象的terms元素能在多大程度上有所帮助...

> btest2 <- glm(var5 ~ var1 + var3 + var2 + var4, data = btest, family = 'binomial')
> btest2$terms
var5 ~ var1 + var3 + var2 + var4
attr(,"variables")
list(var5, var1, var3, var2, var4)
attr(,"factors")
     var1 var3 var2 var4
var5    0    0    0    0
var1    1    0    0    0
var3    0    1    0    0
var2    0    0    1    0
var4    0    0    0    1
attr(,"term.labels")
[1] "var1" "var3" "var2" "var4"
attr(,"order")
[1] 1 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
attr(,"predvars")
list(var5, var1, var3, var2, var4)
attr(,"dataClasses")
     var5      var1      var3      var2      var4 
 "factor" "numeric"  "factor"  "factor" "numeric" 
> attr(btest2$terms, 'dataClasses')
     var5      var1      var3      var2      var4 
 "factor" "numeric"  "factor"  "factor" "numeric"
gk7wooem

gk7wooem1#

如果你将拟合保存到变量my_fit中,你就可以对所有分类变量执行my_fit$xlevels,然后你会看到它们的所有水平。
然后,您可以将其与模型相关联。例如,var1不在x个水平中,因此它是连续的。Var2有3个水平(a、B、c),并且您有b和c的估计值。这意味着a是参考。Var3有类别e、f、g,并且您有f和g的估计值,因此e必须是参考。

my_fit <- glm(var5 ~ var1 + var2 + var3 + var4, data = btest, family = 'binomial')

my_fit$xlevels
$var2
[1] "a" "b" "c"

$var3
[1] "e" "f" "g"

> summary(my_fit)

Call:
glm(formula = var5 ~ var1 + var2 + var3 + var4, family = "binomial", 
    data = btest)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0344  -1.1100   0.5975   0.9605   1.9985  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   0.7321     0.8111   0.903  0.36673   
var1         -0.5061     0.2846  -1.778  0.07533 . 
var2b        -0.1929     0.5904  -0.327  0.74385   
var2c        -0.1968     0.5442  -0.362  0.71764   
var3f        -1.1015     0.5816  -1.894  0.05824 . 
var3g        -0.2004     0.5629  -0.356  0.72187   
var4          0.3945     0.1290   3.058  0.00223 **
---
pw9qyyiw

pw9qyyiw2#

下面是一个函数,它提取xlevels并使用broom::tidy(通过一些额外的操作),以便引用级别与所有其他项一起位于 Dataframe 中:

library(tidyverse)
library(broom)

tidy_coefs_with_ref <- function(mod_obj, sep = "_"){

  tidy_coefs <- tidy(mod_obj) %>% 
    separate(term, c("variable", "level"), sep, remove = FALSE) %>% 
    mutate(variable = paste0(variable, sep))

  xlevels <- mod_obj$xlevels  

  missing_levels <- xlevels %>% 
    enframe() %>% 
    unnest() %>% 
    set_names(c("variable", "level"))

  missing_levels %>% 
    anti_join(tidy_coefs) %>% 
    bind_rows(tidy_coefs) %>% 
    arrange(variable, level)

}

btest <- tibble(var1 = sample(c(1,2,3), 100, replace = T),
                var2 = sample(c('a','b','c'), 100, replace = T),
                var3 = sample(c('e','f','g'), 100, replace = T),
                var4 = rnorm(100, mean = 3, 2),
                var5 = sample(c(TRUE, FALSE), 100, replace = T)) %>% 
  rename_if(is.character, funs(paste0(., "_")))

btest2 <- glm(var5 ~ ., data = btest, family = 'binomial')

tidy_coefs_with_ref(btest2)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1,
#> 2, 7].
#> Joining, by = c("variable", "level")
#> # A tibble: 9 x 7
#>   variable     level term         estimate std.error statistic p.value
#>   <chr>        <chr> <chr>           <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)_ <NA>  (Intercept)   0.904       0.835    1.08     0.279
#> 2 var1_        <NA>  var1         -0.126       0.269   -0.468    0.640
#> 3 var2_        a     <NA>         NA          NA       NA       NA    
#> 4 var2_        b     var2_b       -0.719       0.515   -1.40     0.162
#> 5 var2_        c     var2_c       -0.632       0.525   -1.21     0.228
#> 6 var3_        e     <NA>         NA          NA       NA       NA    
#> 7 var3_        f     var3_f       -0.379       0.496   -0.764    0.445
#> 8 var3_        g     var3_g        0.429       0.517    0.829    0.407
#> 9 var4_        <NA>  var4         -0.00833     0.111   -0.0749   0.940

reprex package(v0.2.1)于2019年2月28日创建
(Step可以清除。)
同样可能相关的是,这里有一个链接,指向我使用上面的effect coding函数(的扩展)来提取降低的级别的影响的幅度的要点:https://gist.github.com/brshallo/f923b9b5c6360ce09beda35c3d1d55e9

相关问题