跟踪R中重复匹配的新列

cetgtptt  于 2023-03-20  发布在  其他
关注(0)|答案(4)|浏览(221)

我有一个超过200列的数据集。我对基于两列TitleDate重复的字符串感兴趣。
输入:

Title                       Date        Country            
This is                     2007           Afghanistan        
a test to                   1999           Switzerland        
find country                2004           Sweden            
names that share            1987           Algeria            
certain titles.             2004           Afghanistan        
a test to                   1999           Albania            
a test to                   1999           Afghanistan        
names that share            1987           Afghanistan        
a test to                   2010           Algeria

具体来说,每当第 i 行的TitleANDDate与第 j 行匹配时,我希望记录第三列Country中显示的所有唯一值。
例如:考虑Title=="a test to" & Date==1999 .以下国家出现在所有匹配中的行:阿尔巴尼亚、阿富汗、瑞士,因此,对于每一行匹配项,我们都要输入Albania, Afghanistan, Switzerland(而不是Algeria等)。
预期产出:

Title                       Date        Country            Total_Countries
This is                     2007           Afghanistan        Afghanistan
a test to                   1999           Switzerland        Albania, Afghanistan, Switzerland
find country                2004           Sweden             Sweden
names that share            1987           Algeria            Algeria, Afghanistan
certain titles.             2004           Afghanistan        Afghanistan
a test to                   1999           Albania            Albania, Afghanistan, Switzerland
a test to                   1999           Afghanistan        Albania, Afghanistan, Switzerland
names that share            1987           Afghanistan        Algeria, Afghanistan
a test to                   2010           Algeria            Algeria

在我看来,这似乎是这个问题的一个更复杂的版本:Create new column with matched values in R
我最初的解决方案是

x <- aggregate(Country ~ Title + Date, df, FUN = paste, collapse=", ")
names(x)[3] <- "Total_Countries"
df <- merge(df, x, all.x=T)

但这并不能产生正确的结果。如有任何建议,将不胜感激。

uplii1fm

uplii1fm1#

分组后可以使用mutate

library(dplyr) # version >= 1.1.0
df %>% 
   mutate(Total_Countries = toString(sort(unique(Country))),
    .by = c(Title, Date))
  • 输出
Title Date     Country                   Total_Countries
1          This is 2007 Afghanistan                       Afghanistan
2        a test to 1999 Switzerland Afghanistan, Albania, Switzerland
3     find country 2004      Sweden                            Sweden
4 names that share 1987     Algeria              Afghanistan, Algeria
5  certain titles. 2004 Afghanistan                       Afghanistan
6        a test to 1999     Albania Afghanistan, Albania, Switzerland
7        a test to 1999 Afghanistan Afghanistan, Albania, Switzerland
8 names that share 1987 Afghanistan              Afghanistan, Algeria
9        a test to 2010     Algeria                           Algeria

数据

df <- structure(list(Title = c("This is", "a test to", "find country", 
"names that share", "certain titles.", "a test to", "a test to", 
"names that share", "a test to"), Date = c(2007L, 1999L, 2004L, 
1987L, 2004L, 1999L, 1999L, 1987L, 2010L), Country = c("Afghanistan", 
"Switzerland", "Sweden", "Algeria", "Afghanistan", "Albania", 
"Afghanistan", "Afghanistan", "Algeria")), class = "data.frame", row.names = c(NA, 
-9L))
rt4zxlrg

rt4zxlrg2#

请注意,您的结果是正确的。顺序通常不同,merge中的sort参数没有多大帮助。要保持顺序,您可以执行以下操作:

x <- aggregate(cbind(Total_Countries = Country)~., unique(df), toString)
by_cols <- c("Title", "Date")
index <- match(interaction(df[by_cols]), interaction(x[by_cols]))
data.frame(df, x[index, setdiff(names(x), by_cols), drop = FALSE], row.names = NULL)

             Title Date     Country                   Total_Countries
1          This is 2007 Afghanistan                       Afghanistan
2        a test to 1999 Switzerland Switzerland, Albania, Afghanistan
3     find country 2004      Sweden                            Sweden
4 names that share 1987     Algeria              Algeria, Afghanistan
5  certain titles. 2004 Afghanistan                       Afghanistan
6        a test to 1999     Albania Switzerland, Albania, Afghanistan
7        a test to 1999 Afghanistan Switzerland, Albania, Afghanistan
8 names that share 1987 Afghanistan              Algeria, Afghanistan
9        a test to 2010     Algeria                           Algeria
jdzmm42g

jdzmm42g3#

假设您有列Title、Date和Country,根据您的问题,我相信您正在尝试创建列Total_Countries。
为了做到这一点,你可以使用tidyverse group_by和summarise:

library(tidyverse)
your_df %>%
  group_by(Title, Date) %>%
  mutate(Total_Countries = toString(unique(Country))) %>%
  ungroup()
93ze6v8z

93ze6v8z4#

library(tidyverse)
df <- structure(list(Title = c("This is", "a test to", "find country", 
                               "names that share", "certain titles.", "a test to", "a test to", 
                               "names that share", "a test to"), Date = c(2007L, 1999L, 2004L, 1987L, 2004L, 1999L, 1999L, 1987L, 2010L), Country = c("Afghanistan", "Switzerland", "Sweden", "Algeria", "Afghanistan", "Albania", "Afghanistan", "Afghanistan", "Algeria")), class = "data.frame", row.names = c(NA, -9L))
df %>% 
  group_by(Title, Date) %>% 
  summarize(Total_Countries = paste(Country, collapse = ", "))

#> `summarise()` has grouped output by 'Title'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups:   Title [5]
#>   Title             Date Total_Countries                  
#>   <chr>            <int> <chr>                            
#> 1 This is           2007 Afghanistan                      
#> 2 a test to         1999 Switzerland, Albania, Afghanistan
#> 3 a test to         2010 Algeria                          
#> 4 certain titles.   2004 Afghanistan                      
#> 5 find country      2004 Sweden                           
#> 6 names that share  1987 Algeria, Afghanistan

相关问题