R:有没有一个等价于Stata的codebookout命令?

vshtjzan  于 2023-09-27  发布在  其他
关注(0)|答案(2)|浏览(116)

在Stata中,我可以使用codebookout命令创建一个Excel工作簿,该工作簿保存现有数据集中所有变量的名称、标签和存储类型及其相应的值和值标签。
我想在R中找到一个等价的函数。到目前为止,我遇到了memisc库,它有一个名为codebook的函数,但它没有做与Stata中相同的事情。
例如,在Stata中,码本的输出看起来像这样...(见下文-这是我想要的)

Variable Name   Variable Label    Answer Label  Answer Code    Variable Type
    hhid               hhid           Open ended                    String
    inter_month        inter_month    Open ended                    long
    year               year           Open ended                    long
    org_unit           org_unit                                     long
                                      Balaka         1  
                                      Blantyre       2  
                                      Chikwawa       3  
                                      Chiradzulu     4

即,评估 Dataframe 中的每一列以产生5个不同列的值:

  • 变量名,列的名称
  • 变量标签,它是列的名称
  • Answer Label是列中的唯一值。如果没有唯一值,则认为它是开放式的
  • 答案代码是答案标签中每个类别的数字分配。如果答案标签不是分类的,则为空。
  • 变量类型:int,string,long(date).

以下是我的尝试:

CreateCodebook <- function(dF){
  numbercols <- length(colnames(dF))

  table <- data.frame()

  for (i in 1:length(colnames(dF))){
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
    VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                                                  rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                                                   rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i], 
                                                  rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType)
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
    table <- rbind(table, df)

  }
  return(table)
}

不幸的是,我收到以下警告消息:

Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) :
  invalid factor level, NA generated

我生成的输出导致Answer Code标签变得混乱:

Variable Name Variable Label Variable Type Answer Code Answer Label
hhid                   hhid           hhid    Open ended                character
month                 month          month    Open ended                  integer
year                   year           year    Open ended                  integer
org_unit           org_unit       org_unit    Open ended                character
v000                   v000           v000    Open ended                character
v001                   v001           v001    Open ended                  integer
v002                   v002           v002    Open ended                  integer
v003                   v003           v003    Open ended                  integer
v005                   v005           v005    Open ended                  integer
v006                   v006           v006    Open ended                  integer
v007                   v007           v007    Open ended                  integer
v021                   v021           v021    Open ended                  numeric
2285                   v024           v024       central        <NA>       factor
1                                                  north        <NA>             
7119                                               south        <NA>             
11                     v025           v025         rural        <NA>       factor
1048                   v025           v025         urban        <NA>       factor
district_name district_name  district_name    Open ended                character
coords_x1         coords_x1      coords_x1    Open ended                  numeric
coords_x2         coords_x2      coords_x2    Open ended                  numeric
itn_color         itn_color      itn_color    Open ended                  numeric
piped                 piped          piped    Open ended                  numeric
sanit                 sanit          sanit    Open ended                  numeric
sanit_cd           sanit_cd       sanit_cd    Open ended                  numeric
water                 water          water    Open ended                  numeric
8hhllhi2

8hhllhi21#

我决定试着把这个当作我自己的娱乐。我使用了内置的Titanic数据集。不过,我对你的一个定义有意见:你说“如果没有唯一值,它被认为是开放式的”。但是每个长度>0的变量都有一些唯一的值:你是说“如果每个值都是唯一的”吗甚至这个定义也不一定像预期的那样工作:在Titanic数据集中,响应是整数,并且在总共32个值中碰巧只有22个唯一值。我不认为人们真的希望枚举它,所以我测试了factor的类型(但是如果你真的想的话,你可以替换下面的length(u)==length(x)行)。

## utility function: pad vector with blanks to specified length
pad <- function(x,n,p="") {
    return(c(x,rep(p,n-length(x))))
}
## process a single column
proc_col <- function(x,nm) {
    u <- unique(x)
    ## if (length(u)==length(x)) {
    if (!is.factor(x)) {
        n <- 1
        u <- "open ended"
        cc <- ""
    } else {
        cc <- as.numeric(u)
        n <- length(u)
    }
    dd <- data.frame(`Variable Name`=pad(nm,n),
               `Variable Label`=pad(nm,n),
               `Answer Label`=u,
               `Answer Code`=cc,
               `Variable Type`=pad(class(x),n),
               stringsAsFactors=FALSE)
    return(dd)
}
## process all columns
proc_df <- function(x) {
    L <- Map(proc_col,x,names(x))
    dd <- do.call(rbind,L)
    rownames(dd) <- NULL
    return(dd)
}

范例:

xx <- as.data.frame.table(Titanic)
proc_df(xx)

##    Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type
## 1          Class          Class          1st           1       factor
## 2                                        2nd           2              
## 3                                        3rd           3              
## 4                                       Crew           4              
## 5            Sex            Sex         Male           1       factor
## 6                                     Female           2              
## 7            Age            Age        Child           1       factor
## 8                                      Adult           2              
## 9       Survived       Survived           No           1       factor
## 10                                       Yes           2              
## 11          Freq           Freq   open ended                  numeric

我没有在代码值列表之前留下空格,但是你可以自己做这些调整。

zlhcx6iw

zlhcx6iw2#

以下是我的解决方案:

CreateCodebook <- function(dF){
  numbercols <- length(colnames(dF))

  table <- data.frame()

  for (i in 1:length(colnames(dF))){
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
    VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
                                                  rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
                                                   rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i],
                                                  rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE)
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
    table <- rbind(table, df)

  }
  rownames(table) <- 1:nrow(table)
  return(table)
}

输出量:

Variable Name Variable Label Variable Type Answer Code Answer Label
1           brid           brid    Open ended                character
2          month          month    Open ended                  integer
3           year           year    Open ended                  integer
4       org_unit       org_unit    Open ended                character
5           v000           v000    Open ended                character
6           v001           v001    Open ended                  integer
7           v002           v002    Open ended                  integer
8           v003           v003    Open ended                  integer
9           v005           v005    Open ended                  integer
10          v006           v006    Open ended                  integer
11          v007           v007    Open ended                  integer
12          v021           v021    Open ended                  numeric
13          v024           v024       central           1       factor
14                                      north           2             
15                                      south           3             
16          v025           v025         rural           1       factor
17                                      urban           2             
18          bidx           bidx    Open ended                  integer
19 district_name  district_name    Open ended                character
20     coords_x1      coords_x1    Open ended                  numeric
21     coords_x2      coords_x2    Open ended                  numeric
22          anc4           anc4    Open ended                  numeric
23    antimal_48     antimal_48    Open ended                  numeric
24         carep          carep    Open ended                  numeric
25          csec           csec    Open ended                  numeric
26          dptv           dptv    Open ended                  numeric
27       ebreast        ebreast    Open ended                  numeric
28       fans_48        fans_48    Open ended                  numeric
29        ideliv         ideliv    Open ended                  numeric
30          iptp           iptp    Open ended                  numeric
31        iron90         iron90    Open ended                  numeric
32      measlesv       measlesv    Open ended                  numeric
33           ors            ors    Open ended                  numeric
34           ort            ort    Open ended                  numeric
35         pncwm          pncwm    Open ended                  numeric
36       sstools        sstools    Open ended                  numeric
37            tt             tt    Open ended                  numeric
38          vita           vita    Open ended                  numeric

相关问题