如果#1包含#2中未找到的信息,如何将信息从 Dataframe #1添加到 Dataframe #2,但如果#1中的信息与#2中的信息不同,则忽略

ecfsfe2w  于 2023-02-01  发布在  其他
关注(0)|答案(2)|浏览(95)

我有一个2021年和一个2020年的鸟巢盒子 Dataframe 。2021年的 Dataframe 字段较少,2021年的 Dataframe 中的盒子值为NA,但在2020年的 Dataframe 中填写。但是,在某些情况下,在共享的字段中,框的信息在 Dataframe 之间不同(即bf2在2020年为“塑料”,在2021年为“木制”)。我希望将2021年填写的所有信息视为正确,并在此数据框中添加:
1.仅在 Dataframe 2020(rf3、rf4、rf6、rf7)中列出的框,填写了所有相应的信息字段
1.将2020年而不是2021年的列(box.age和land.water)追加到2021年所列的框(bf1、bf2、rf1)中。
1.填写2020年有但2021年没有的任何缺失信息(bf8boxes.per.post、bf9 habitat.type和box.material)

#Sample Code
box.id <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9")
boxes.per.post <- c("single", "single", "single", "single", "duplex", "single", NA, "duplex")
habitat.type <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", NA)
box.material <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", NA)

Data2021 <- data.frame(box.id, boxes.per.post, habitat.type, box.material)

box.id.1 <- c("bf1", "bf2", "rf1", "rf3", "rf4", "rf6", "rf7", "bf8", "bf9")
boxes.per.post.1 <- c("single", "single", "single", "single", "duplex", "single", "duplex", "single", "single")
habitat.type.1 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Ephemeral Wetland",  
                    "Ephemeral Wetland", "Ephemeral Wetland", "River/Creek", "Pond", "Pond")
box.material <- c("wooden", "plastic", "plastic", "wooden","wooden","wooden","wooden", "plastic", "wooden")
box.age <- c(1,3,4,5,6,3,7,8, 12)
land.water <- c("land", "water", "land", "water", "water", "water", "water", "land", "water")

Data2020 <- data.frame(box.id.1, boxes.per.post.1, habitat.type.1, box.material, box.age, land.water)

colnames(Data2020) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")

#End Sample Code

我尝试了以下方法:

#Start attempts
cbind(Data2021, Data2020)                                                         #error... different number of rows

Data.merged <- left_join(Data2021, Data2020, by = "box.id")                       #doesn't add rf3-rf7 

Data.merged.1 <- left_join(Data2021, Data2020, 
                           by = c("box.id", "boxes.per.post", "habitat.type", "box.material"))  #only appends rows that match completely

Data.merged.2 <- full_join(Data2021, Data2020, by = "box.id")         
#creates separate columns for matching fields (ie boxes.per.post.x and boxes.per.post.y)

Data.merged.3 <- merge(Data2021, Data2020, by = "box.id", all.x = TRUE, all.y = TRUE) #creates extra columns again
#End attempts

我遇到的主要问题是数据表之间的单个框的信息不一致。我想这需要几个步骤才能完成,我只是在努力将必要的步骤放在一起。
以下是我的预期结果:

box.id.2 <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9", "rf3", "rf4", "rf6", "rf7")
boxes.per.post.2 <- c("single", "single", "single", "single", "duplex", "single", "single", "duplex", "single", "duplex",
                      "single", "duplex")
habitat.type.2 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", "Pond",
                    "Ephemeral Wetland", "Ephemeral Wetland", "Ephemeral Wetland", "River/Creek")
box.material.2 <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", "wooden", "wooden", "wooden",
                    "wooden","wooden")
box.age.2 <- c(1,3,NA, NA,4,NA,8,12,5,6,3,7)
land.water.2 <- c("land", "water", NA, NA, "land", NA, "land", "water", "water", "water", "water", "water")

Data.expected <- data.frame(box.id.2, boxes.per.post.2, habitat.type.2, box.material.2, box.age.2, land.water.2)
colnames(Data.expected) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")
wgx48brx

wgx48brx1#

Data.expected的行数与两个输入集的行数不同,但我认为合并后需要的是一个coalesce(在所有重复的列上)。
试试这个:

left_join(Data2020, Data2021, by = "box.id") %>%
  mutate(across(ends_with(".x"), ~ coalesce(cur_data()[[ sub("\\.x$", ".y", cur_column()) ]], .))) %>%
  select(-ends_with(".y")) %>%
  rename_with(.fn = ~ sub("\\.x$", "", .))
#   box.id boxes.per.post      habitat.type box.material box.age land.water
# 1    bf1         single Ephemeral Wetland       wooden       1       land
# 2    bf2         single       River/Creek       wooden       3      water
# 3    rf1         duplex              Pond       wooden       4       land
# 4    rf3         single Ephemeral Wetland       wooden       5      water
# 5    rf4         duplex Ephemeral Wetland       wooden       6      water
# 6    rf6         single Ephemeral Wetland       wooden       3      water
# 7    rf7         duplex       River/Creek       wooden       7      water
# 8    bf8         single              Pond      plastic       8       land
# 9    bf9         duplex              Pond       wooden      12      water

coalesce是一个函数,它返回其非NA向量中的第一个值。它是向量化的,并且接受一个或多个向量,因此下面演示了它的一些功能:

coalesce(1, NA) # no change
# [1] 1
coalesce(NA, 1)
# [1] 1
coalesce(2, NA, 1) # no change
# [1] 2
coalesce(NA, NA, 1)
# [1] 1

在每种情况下,它都使用第一个非NA值向量。
基于我不知道的事情,这可能需要调整:

  • full_join而不是left_join,这可能会导致Data.expected中的附加列
  • coalesce d的顺序 Package 起来。就像现在这样,我将RHS中的非NA值(Data2021,名为*.x的连接列)优先于LHS(.y),这是一个任意的选择。

across是在多个列上迭代相同函数/过程的好方法,因为它一次只对一列起作用(内部表达式/代码只能看到值的向量),我们可以使用cur_column()的特殊函数(例如,它可能返回字符串"land.water.x"),sub删除结束文字".y",和cur_data()[[ ... ]]来检索c("", ".y")对中的另一列(当然还有其他技巧,包括在across中使用.names="...",但我不认为它们在这里添加了 * 功能 *)。

9cbw7uwe

9cbw7uwe2#

您也可以尝试:

Data2021$year = 2021
 Data2020$year = 2020
  
  final <- full_join(Data2021, Data2020) |> 
    arrange(year) |> 
    group_by(box.id) |> 
    summarise(across(everything(), ~ifelse(any(is.na(.)), max(., na.rm=TRUE), last(.)))) |> mutate(box.age = na_if(box.age, -Inf))

相关问题