R语言 如何识别两列中的连接值组,其中每列都有多个匹配的重复值

jgwigjjp  于 2023-09-27  发布在  其他
关注(0)|答案(1)|浏览(99)

取以下数据。我想添加一个列,指示每行属于哪个连接值组。

library(tidyverse)
df <- structure(list(fruit = c("apple", "apple", "apple", "pear", "pear", 
                               "banana", "banana", "peach", "cherry"), name = c("joe", "sally", 
                                                                                "steve", "pete", "kate", "george", "alex", "alex", "alex")), class = c("tbl_df", 
                                                                                                                                                       "tbl", "data.frame"), row.names = c(NA, -9L))
df
# A tibble: 9 × 2
  fruit  name  
  <chr>  <chr> 
1 apple  joe   
2 apple  sally 
3 apple  steve 
4 pear   pete  
5 pear   kate  
6 banana george
7 banana alex  
8 peach  alex  
9 cherry alex

这是我想要的输出类型。组1和组2很简单--它们只是通过公共值fruit连接起来。
第三组比较复杂。乔治和香蕉有联系。香蕉连接到亚历克斯,谁也连接到桃子和樱桃。所以第三组包含乔治、Alex、banana、peach和cherry。

# A tibble: 9 × 3
  fruit  name   group 
  <chr>  <chr>  <chr> 
1 apple  joe    group1
2 apple  sally  group1
3 apple  steve  group1
4 pear   pete   group2
5 pear   kate   group2
6 banana george group3
7 banana alex   group3
8 peach  alex   group3
9 cherry alex   group3

本质上,group字段需要包含一个公共ID,用于网络图中连接的所有值,如下所示:

tidygraph::as_tbl_graph(df) %>%
  ggraph(layout = "tree") +
  geom_edge_link() + 
  geom_node_point() +
  geom_node_label(aes(label = name))

zy1mlcev

zy1mlcev1#

您可以从igraph尝试components

library(igraph)
df %>%
    mutate(group = paste0("group", {
        graph_from_data_frame(.) %>%
            components() %>%
            membership() %>%
            `[`(fruit)
    }))

这给

# A tibble: 9 × 3
  fruit  name   group
  <chr>  <chr>  <chr>
1 apple  joe    group1
2 apple  sally  group1
3 apple  steve  group1
4 pear   pete   group2
5 pear   kate   group2
6 banana george group3
7 banana alex   group3
8 peach  alex   group3
9 cherry alex   group3

相关问题