我有一个2021年和一个2020年的鸟巢盒子 Dataframe 。2021年的 Dataframe 字段较少,2021年的 Dataframe 中的盒子值为NA,但在2020年的 Dataframe 中填写。但是,在某些情况下,在共享的字段中,框的信息在 Dataframe 之间不同(即bf2在2020年为“塑料”,在2021年为“木制”)。我希望将2021年填写的所有信息视为正确,并在此数据框中添加:
1.仅在 Dataframe 2020(rf3、rf4、rf6、rf7)中列出的框,填写了所有相应的信息字段
1.将2020年而不是2021年的列(box.age和land.water)追加到2021年所列的框(bf1、bf2、rf1)中。
1.填写2020年有但2021年没有的任何缺失信息(bf8boxes.per.post、bf9 habitat.type和box.material)
#Sample Code
box.id <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9")
boxes.per.post <- c("single", "single", "single", "single", "duplex", "single", NA, "duplex")
habitat.type <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", NA)
box.material <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", NA)
Data2021 <- data.frame(box.id, boxes.per.post, habitat.type, box.material)
box.id.1 <- c("bf1", "bf2", "rf1", "rf3", "rf4", "rf6", "rf7", "bf8", "bf9")
boxes.per.post.1 <- c("single", "single", "single", "single", "duplex", "single", "duplex", "single", "single")
habitat.type.1 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Ephemeral Wetland",
"Ephemeral Wetland", "Ephemeral Wetland", "River/Creek", "Pond", "Pond")
box.material <- c("wooden", "plastic", "plastic", "wooden","wooden","wooden","wooden", "plastic", "wooden")
box.age <- c(1,3,4,5,6,3,7,8, 12)
land.water <- c("land", "water", "land", "water", "water", "water", "water", "land", "water")
Data2020 <- data.frame(box.id.1, boxes.per.post.1, habitat.type.1, box.material, box.age, land.water)
colnames(Data2020) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")
#End Sample Code
我尝试了以下方法:
#Start attempts
cbind(Data2021, Data2020) #error... different number of rows
Data.merged <- left_join(Data2021, Data2020, by = "box.id") #doesn't add rf3-rf7
Data.merged.1 <- left_join(Data2021, Data2020,
by = c("box.id", "boxes.per.post", "habitat.type", "box.material")) #only appends rows that match completely
Data.merged.2 <- full_join(Data2021, Data2020, by = "box.id")
#creates separate columns for matching fields (ie boxes.per.post.x and boxes.per.post.y)
Data.merged.3 <- merge(Data2021, Data2020, by = "box.id", all.x = TRUE, all.y = TRUE) #creates extra columns again
#End attempts
我遇到的主要问题是数据表之间的单个框的信息不一致。我想这需要几个步骤才能完成,我只是在努力将必要的步骤放在一起。
以下是我的预期结果:
box.id.2 <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9", "rf3", "rf4", "rf6", "rf7")
boxes.per.post.2 <- c("single", "single", "single", "single", "duplex", "single", "single", "duplex", "single", "duplex",
"single", "duplex")
habitat.type.2 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", "Pond",
"Ephemeral Wetland", "Ephemeral Wetland", "Ephemeral Wetland", "River/Creek")
box.material.2 <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", "wooden", "wooden", "wooden",
"wooden","wooden")
box.age.2 <- c(1,3,NA, NA,4,NA,8,12,5,6,3,7)
land.water.2 <- c("land", "water", NA, NA, "land", NA, "land", "water", "water", "water", "water", "water")
Data.expected <- data.frame(box.id.2, boxes.per.post.2, habitat.type.2, box.material.2, box.age.2, land.water.2)
colnames(Data.expected) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")
2条答案
按热度按时间wgx48brx1#
Data.expected
的行数与两个输入集的行数不同,但我认为合并后需要的是一个coalesce
(在所有重复的列上)。试试这个:
coalesce
是一个函数,它返回其非NA
向量中的第一个值。它是向量化的,并且接受一个或多个向量,因此下面演示了它的一些功能:在每种情况下,它都使用第一个非
NA
值向量。基于我不知道的事情,这可能需要调整:
full_join
而不是left_join
,这可能会导致Data.expected
中的附加列coalesce
d的顺序 Package 起来。就像现在这样,我将RHS中的非NA
值(Data2021
,名为*.x
的连接列)优先于LHS(.y
),这是一个任意的选择。across
是在多个列上迭代相同函数/过程的好方法,因为它一次只对一列起作用(内部表达式/代码只能看到值的向量),我们可以使用cur_column()
的特殊函数(例如,它可能返回字符串"land.water.x"
),sub
删除结束文字".y"
,和cur_data()[[ ... ]]
来检索c("", ".y")
对中的另一列(当然还有其他技巧,包括在across
中使用.names="..."
,但我不认为它们在这里添加了 * 功能 *)。9cbw7uwe2#
您也可以尝试: