需要帮助使用R从Google表单处理多个响应字符串

uinbv5nw  于 11个月前  发布在  Go
关注(0)|答案(3)|浏览(133)

我试图在R中处理来自Google Form的结果,但在处理字符串数据时遇到了困难。
问题可以在这里看到:


的数据
Google在一个单独的列中返回结果,每个响应用逗号分隔。
他们最后看起来就像

ID | Type of Research
=====================
1  | Policy analysis, Review of other research
2  | Bla
3  | Review of other research, Original empirical research
4  | Policy analysis, Theoretical 
5  | Review of other research

字符串
我已经使用grepl为三个预先选择的响应创建了逻辑列和一个data.frame。

Private$ResearchTypeOriginal <- grepl("Original", Private$ResearchType)
Private$ResearchTypeReview <- grepl("Review", Private$ResearchType)
Private$ResearchTypePolicy <- grepl("Policy", Private$ResearchType)

ResearchTypeGrid <- data.frame(Private$ResearchTypeOriginal, Private$ResearchTypeReview, Private$ResearchTypePolicy)


这很好用。但是,我也需要把“其他“的拉出来。我用的是

ResearchTypeOther <- subset(Private, !grepl("Original", Private$ResearchType) & !grepl("Review", Private$ResearchType) & !grepl("Policy", Private$ResearchType), select=c(ID, ResearchType, PubLang, Reviewer))
ResearchTypeOther <- na.omit(ResearchTypeOther)


但我刚刚意识到,如果一个响应既有一个预先选择的响应,又有一个开放式的响应,那么使用这种方法就失去了它。它可以很好地给我“Bla”响应,但只能给那些完全是“其他”的响应。
换句话说,

ID |  Type of Research
=======================
2  |  Bla


但我希望的是

ID |  Type of Research
======================
2  |  Bla
4  |  Policy analysis, Theoretical


这是我第一次在SO上发帖,我显然是R的新手,所以请原谅我提问时的任何错误。如果我的措辞不好,我很抱歉。我还有大约20个其他问题有同样的问题,所以我需要一个灵活的解决方案。
谢谢你的帮助

pkmbmrz7

pkmbmrz71#

你可以“regex你的方式通过”在静脉

doc <- readLines(n = 5)
1  | Policy analysis, Review of other research
2  | Bla
3  | Review of research, Original empirical research
4  | Policy analysis, Theoretical 
5  | Review of other research

items <- c("Review of other research", 
           "Original empirical research", 
           "Policy analysis")
(others <- gsub(sprintf("(,\\s)?(%s)(,\\s)?", paste(items, collapse = "|")), "", 
           sub(".*\\|\\s(.*)", "\\1", doc)))
# [1] ""                   "Bla"                "Review of research"
# [4] "Theoretical "       ""  

sub(sprintf("(,\\s)?(%s)(,\\s)?", paste(others[others != ""], collapse = "|")), "", doc)
# [1] "1  | Policy analysis, Review of other research"
# [2] "2  | "                                         
# [3] "3  | Original empirical research"              
# [4] "4  | Policy analysis"                          
# [5] "5  | Review of other research"

字符串

jjjwad0x

jjjwad0x2#

多亏了卢克,一点也不优雅,但这个很有效:

items <- c("Review of other research", 
           "Original empirical research", 
           "Policy analysis")
ResearchTypeOther <- data.frame((others <- gsub(sprintf("(,\\s)?(%s)(,\\s)?", paste(items, collapse = "|")), "", 
           sub(".*\\|\\s(.*)", "\\1", Private$ResearchType))))
ResearchTypeOther[ResearchTypeOther==""] <- NA
ResearchTypeOther <- na.omit(ResearchTypeOther)

字符串

c0vxltue

c0vxltue3#

你可以试试:(使用来自@lukeA的docitems

library(stringr)
 doc[sapply(strsplit(doc, "\\d +\\||,"), function(x) {
                 x1 <- str_trim(x)
                 x2 <- x1[x1!='']
                 indx <- x2 %in% items
                 !(any(indx) & tail(indx,1))})]
  #[1] "2  | Bla"                            "4  | Policy analysis, Theoretical

字符串

相关问题