从.dta文件中重新编码R中缺失的不同“类型”

smdnsysy 于 2023-05-26 发布在其他

关注(0)|答案(1)|浏览(162)

我有一个Stata文件（.dta），其中包含不同类型的缺失数据（要么是因为问题与此人无关（.），要么是因为某人不知道答案（.r）。这些差异对我的分析很重要。我没有访问Stata的权限，希望在R中进行此分析。我查看了{sjlabelled}、{labelled}和{haven}包，但无法找到重新编码这些不同类型的丢失数据的方法。
Stata命令tab q2, m给出

q2 |
xxxxxxxxxxxxxxxxx |
x sorry sensitive |
      xxxxxxxxxxx |
       xxxxxxxxxx |      Freq.     Percent        Cum.
------------------+-----------------------------------
               No |        342       14.43       14.43
              Yes |        673       28.40       42.83
                . |      1,234       52.07       94.89
               .r |        121        5.11      100.00
------------------+-----------------------------------
            Total |      2,370      100.00

然而，在R中，.和.r之间没有区别

table(mydf$q2, useNA = "always")

给予

0    1 <NA> 
 342  673 1355

然而，R确实认识到有不同的“类型”缺失（NA和NA(r)）

sjlabelled::tidy_labels(mydf$q2)
<labelled<double>[2370]>: q2: xxxxx?
   [1]     1     NA   NA(r)     1     1    NA    NA    NA    NA    NA     0     0     1     1    NA     1    NA     1    NA    NA    NA    NA    NA     1    NA    NA    NA    NA(r)    NA    NA    NA(r)    NA

和

> get_labels(mydf$q2, values = "n", drop.na = FALSE)  
               -888                   0                   1 
"Unsure/Don’t Know"                "No"               "Yes"

如何将Unsure/Don’t Know类别重新标记为变量而不是missing，同时保持其他missing实际上是missing？

更新

每个请求还将输出str()

> str(mydf$q2)
 dbl+lbl      1,     1,    NA,     1,    NA,     0,    NA,    NA,    NA,     0,    NA,    NA,     1,    NA, NA(r),    NA,    NA,    NA, NA(r),    NA,    NA
 @ label       : chr "xxxx?"
 @ format.stata: chr "%19.0g"
 @ labels      : Named num [1:3] -888 0 1
  ..- attr(*, "names")= chr [1:3] "Unsure/Don’t Know" "No" "Yes"

数据

这里是一个link到一个具有相同数据结构的小数据集

r

来源：https://stackoverflow.com/questions/70664868/recode-different-types-of-missing-in-r-from-a-dta-file

1条答案

按热度按时间

v6ylcynt1#

也许有最好的方法来解决你的问题，但我想到了一种方法，在R中创建一个变量，在那里你可以看到不同类型的缺失值。这个变量可以指导你以后的数据分析。
1.你可以在Stata中添加一个标签，例如，使用我写的数据集：

use test_data.dta, clear 
gen index = "" 
replace index = "no missing" if !missing(s2)
replace index = "missing value type 1" if s2 == .r
replace index = "missing value type 2" if s2 == .
save test_data_labeled.dta, replace

2. In R you can import your data with the labels. Later, you can generate a new variable (s2_mod in this case) as a character that considers all the possible values of your missings. 

# Import data and packages 
rm(list = ls())

library(pacman)

p_load(rio, 
       here, 
       tidyverse, 
       magrittr)

df <- import(here("datos", "test_data_labeled.dta"))

#  Recode your variables according the missing values type 
df <- df %>%
        mutate(mv1 = ".r", mv2 =".")

df2 <- df %>% 
  transmute( s2 = as.character(s2), 
             s2_mod = case_when(
                      index == "missing value type 1"  ~ mv1,      
                      index == "missing value type 2" ~ mv2))  

 df2$s2_mod[is.na(df2$s2_mod)]  <- df2$s2[!is.na(df2$s2)]

在这种情况下，变量s2_mod是具有考虑“r”和“.”的所有原始值的字符。

赞(0）回复(0）举报 2023-05-26

我来回答

从.dta文件中重新编码R中缺失的不同“类型”

更新

数据

1条答案

相关问题

热门标签

最新问答