R语言 清理复杂数据的最简单方法[已关闭]

vuktfyat  于 2023-01-18  发布在  其他
关注(0)|答案(1)|浏览(192)
    • 已关闭**。此问题需要超过focused。当前不接受答案。
    • 想要改进此问题吗?**更新此问题,使其仅关注editing this post的一个问题。

昨天关门了。
Improve this question
想象一下数据集:

df1 <- tibble::tribble(~City,   ~Year,  ~Coffee,    ~Tea,   ~Year,  ~Sugar, ~At,    ~Empty,
"NY",   "2020", "", "", "2020", "2",    "", "",
"NY",   "2019", "5",    "3",    "2019", "5",    "", "",
"City", "Year", "Coffee",   "Tea",  "Year", "Sugar",    "", "",
"ATL",  "2020", "", "", "2020", "2",    "", "",
"ATL",  "2019", "5",    "3",    "2019", "5",    "", "",
"Data input by: Alex",  "", "", "", "", "", "", "",
"BOS",  "Year", "", "Coffee",   "", "Tea",  "Sugar",    "",
"BOS",  "2020", "", "7",    "2020", "8",    "3",    "",
"BOS",  "2019", "", "7",    "2019", "7",    "2",    "",
"MS",   "Year", "Frappacino",   "Green Tea",    "", "Coffee",   "Sugar",    "",
"MS",   "2020", "5",    "6",    "2019", "8",    "3",    "",
"MS",   "2019", "5",    "5",    "2020", "8",    "3",    "",
"City", "Year", "Coffee",   "Tea",  "Year", "Sugar",    "At",   "",
"HW",   "2020", "500",  "300",  "2020", "200",  "", "",
"HW",   "2019", "450",  "320",  "2019", "180",  "", "",
"Data input by: Aleksanteri",   "", "", "", "", "", "", "",
"Kaupunki", "Vuosi",    "Kahvi",    "Tee",  "Vuosi",    "At",   "Sokeri",   "",
"HEL",  "2020", "7",    "4",    "2018", "", "4",    "",
"HEL",  "2019", "7",    "4",    "2019", "", "4",    "",
"HEL",  "2018", "6",    "3",    "2020", "", "5",    "")

相同的数据集(用于直观表示):

问题:
1.城市ATL是错误的(它是纽约的副本)[但我们无法知道是否存在这样的模式]
1.有两个人将数据输入到原始数据库中(红色行,第一列,但可以是该行中的任何位置)

  1. BOS从年开始移动一个单元格;缺少第二个名称"Year"
  2. MS有星冰乐和绿茶,但没有茶的数据:(
  3. MS有咖啡数据按年份翻转(2020年是2019年,2019年是2020年)
  4. HW的货币输入错误(数字差异太大);正确的货币是X/100(500硬件等于5(500/100))
  5. HEL在输入中有芬兰语的列名(我们希望控制数据不移位);这样我们就知道:Vuosi =年份;Kaupunki =城市;Kahvi =咖啡Tee =茶; Sokeri =糖:)
    1.与Kahvi(咖啡)和Tee(茶)相比,HEL在Sokeri(糖)的销售上有几年的变化。
  • 空列是完全空的(它可以放置在数据中的潜在偏移[不是我们的情况]。

有没有一个简单的方法来处理这样的数据问题?
我刚刚面对这个问题,我正在一个一个地解决它,然后停下来想一个想法,应该有一个简单的方法来做每件事。

s3fp2yjn

s3fp2yjn1#

有点快和肮脏,你可能需要一些微调,如果你的数据甚至比我假设的更多变。
在我的方法中,一个关键要求是第一行,至少是City有正确的名称,Year在第一行和已知位置的名称是正确的(您可以通过表的名称查找,但Year不能在列中歪斜。
我在这里使用data.table

library(data.table)

setDT(df1)

# find the year columns
cols <- names(df1)
ypos <- last(which(cols %in% "Year"))

# split your data to deal with differences in years per row (for HEL)
dt <- rbindlist(list(df1[, 1L:(ypos-1L), with = F], df1[, c(1L, ypos:length(cols)), with = F]), fill = T)

# melt data and take the first row as headers we will fix later on
# this only works though when you know the Year column is never shifted to another column
dt <- melt.data.table(dt, id.vars = c("City", "Year"), variable.factor = F)

dt <- dt[!(Year == "" & value == ""),]
dt <- dt[, City := fifelse(City %in% c("City", "Kaupunki"), shift(City, type = "lead"), City)]
dt <- dt[!(value == "" | value == "At")]

v_f <- c("Kahvi", "Tee", "Sokeri")
v_t <- c("Coffee", "Tea", "Sugar")
dt[, value := str_replace_all(value,setNames(v_t, v_f))]

dt[, new_variable := first(value), rleid(City)]
dt[is.na(as.numeric(new_variable)), variable := new_variable][, new_variable := NULL]
dt[, value := as.numeric(value)]
dt <- dt[!is.na(value)]
dt[, value := fifelse(value > 100, value / 100, value)]

dcast(dt, City + Year ~ variable, value.var = "value")

结果

City Year Coffee Frappacino Green Tea Sugar Tea
 1:  ATL 2019    5.0         NA        NA   5.0 3.0
 2:  ATL 2020     NA         NA        NA   2.0  NA
 3:  BOS 2019    7.0         NA        NA   2.0 7.0
 4:  BOS 2020    7.0         NA        NA   3.0 8.0
 5:  HEL 2018    6.0         NA        NA   4.0 3.0
 6:  HEL 2019    7.0         NA        NA   4.0 4.0
 7:  HEL 2020    7.0         NA        NA   5.0 4.0
 8:   HW 2019    4.5         NA        NA   1.8 3.2
 9:   HW 2020    5.0         NA        NA   2.0 3.0
10:   MS 2019    8.0          5         5   3.0  NA
11:   MS 2020    8.0          5         6   3.0  NA
12:   NY 2019    5.0         NA        NA   5.0 3.0
13:   NY 2020     NA         NA        NA   2.0  NA

相关问题