R语言 如何根据除一列之外的所有列识别和删除重复项?

b5buobof  于 2023-11-14  发布在  其他
关注(0)|答案(1)|浏览(135)

我有像下面这样的数据框,想要根据除了日期之外的所有列找到重复的数据,使用排除的列(日期)来确定要删除哪些行(只保留最近的日期)。这样做不会丢失列。

ID    Fn       Ln       date
1   1   Joe   Schmoe 2001-01-01
2   1   Joe   Schmoe 2010-01-01
3   6   Joe   Schmoe 2001-01-01
4   2 Stacy Fakename 2002-02-02
5   2 Stacy Fakename 2020-02-02
6   3 Craig  Collins 2030-03-03
7   3 Craig  Collins 2003-03-03
8   4   Leo     Fern 2040-04-04
9   4   Leo     Fern 2004-04-04
10  5 Penny  Diamond 2005-05-05
11  5 Penny  Diamond 2050-05-05

字符串
因此,三行Joe Schmoe的代码应该发现只有两行是相同的,其中一行因为ID不同而被取消资格,其余两行除了2010年的日期之外都是相同的。
我最终希望将Joe ID 6和最近的副本(Joe ID 1 Date 2010)这样的唯一性保留在同一个表中,删除旧的副本(Joe ID 1 Date 2001)。

数据

data <- data.frame(ID=c(1, 1, 6, 2, 2, 3, 3, 4, 4, 5, 5), 
                   Fn=c("Joe", "Joe", "Joe", "Stacy", "Stacy", "Craig", "Craig", "Leo", "Leo", "Penny", "Penny"), 
                   Ln=c("Schmoe", "Schmoe", "Schmoe", "Fakename", "Fakename", "Collins", "Collins", "Fern", "Fern", "Diamond", "Diamond"), 
                   date=c("2001-01-01", "2010-01-01", "2001-01-01", "2002-02-02", "2020-02-02", "2030-03-03", "2003-03-03", "2040-04-04", "2004-04-04", "2005-05-05", "2050-05-05")
)

kpbpu008

kpbpu0081#

为每个ID创建一个vector,它给出了基础integer日期结构的order,这可以很容易地用ave完成。为了确保你有日期,使用as.Date(date)。最后,简单地为1 sts子集。

> subset(data, ave(-as.integer(as.Date(date)), ID, FUN=order) == 1L)
   ID    Fn       Ln       date
2   1   Joe   Schmoe 2010-01-01
3   6   Joe   Schmoe 2001-01-01
5   2 Stacy Fakename 2020-02-02
6   3 Craig  Collins 2030-03-03
8   4   Leo     Fern 2040-04-04
11  5 Penny  Diamond 2050-05-05

字符串

  • 数据:*
> dput(data)
structure(list(ID = c(1, 1, 6, 2, 2, 3, 3, 4, 4, 5, 5), Fn = c("Joe", 
"Joe", "Joe", "Stacy", "Stacy", "Craig", "Craig", "Leo", "Leo", 
"Penny", "Penny"), Ln = c("Schmoe", "Schmoe", "Schmoe", "Fakename", 
"Fakename", "Collins", "Collins", "Fern", "Fern", "Diamond", 
"Diamond"), date = c("2001-01-01", "2010-01-01", "2001-01-01", 
"2002-02-02", "2020-02-02", "2030-03-03", "2003-03-03", "2040-04-04", 
"2004-04-04", "2005-05-05", "2050-05-05")), class = "data.frame", row.names = c(NA, 
-11L))

相关问题