查找R中跨列的特定组合

k2fxgqgv  于 2023-03-27  发布在  其他
关注(0)|答案(2)|浏览(94)

我有一个R Dataframe ,其中包含特定的字符组合(下面的df 1)和另一个具有sampleID列表的 Dataframe 以及它们是否具有每个字符(df 2)。对于每行在df 1中,如果df 2中的sampleID包含所有条目,则尝试添加sampleID。如果sampleID具有来自一个特定组合的所有条目,则应注意sampleID,如果它有更多的条目也没关系。例如,相同的sampleID可以出现在df 1的多个行中。(在真实的的数据集中,不是df 1中的每一行都正好有3个条目。)

df1 <- data.frame(entry1 = c("A","B","C"),
                  entry2 = c("D","E","F"),
                  entry3 = c("G","H","I"))

df2 <- data.frame(sampleID = c("1001","1002","1003","1004","1005"),
                  "A" = c("A","0","0","A","A"),
                  "B" = c("B","B","B","0","0"),
                  "C" = c("0","0","0","C","C"),
                  "D" = c("D","0","D","0","0"),
                  "E" = c("E","E","0","0","0"),
                  "F" = c("0","0","0","F","F"),
                  "G" = c("G","0","0","G","0"),
                  "H" = c("H","H","H","H","0"),
                  "I" = c("0","0","I","O","0"))

示例输出如下所示:

df1.2 <- data.frame(entry1 = c("A","B","C"),
                    entry2 = c("D","E","F"),
                    entry3 = c("G","H","I"),
                    sampleID.1 = c("1001","1001",""),
                    sampleID.2 = c("","1002",""))

使得行/组合1由样本ID 1001实现,行/组合2由样本ID 1001和1002实现,且没有样本ID具有组合3。
我尝试用for循环迭代df 2中的行,但无法正确地将sampleID添加到df 1中。可能有一个更好的策略。我也愿意转换df。
谢谢你的建议。

nwlls2ji

nwlls2ji1#

一种方法是

  • cross_join数据
  • 检查状况
  • pivot_wider创建sampleID表
  • 然后通过使用rename_with清除名称来完成
library(dplyr)
library(tidyr)

cross_join(df1, df2) %>% 
  rowwise() %>% 
  mutate(res = all(across(starts_with("entry")) %in% across(A:I))) %>% 
  select(1:4, res) %>% 
  ungroup() %>% 
  filter(res | (row_number() == 1 & !any(res)), 
    .by = c(entry1, entry2, entry3)) %>% 
  pivot_wider(names_from = sampleID, values_from = sampleID,
    names_glue = "{.value}_{.name}", values_fill = "") %>% 
  rowwise() %>% 
  mutate(replace(across(starts_with("sample")), !res, "")) %>% 
  rename_with(function(x) sub("_100", ".", x)) %>% 
  ungroup() %>% 
  select(-res)
# A tibble: 3 × 5
  entry1 entry2 entry3 sampleID.1 sampleID.2
  <chr>  <chr>  <chr>  <chr>      <chr>     
1 A      D      G      "1001"     ""        
2 B      E      H      "1001"     "1002"    
3 C      F      I      ""         ""
vtwuwzda

vtwuwzda2#

如果您对使用循环感兴趣,可以尝试以下操作。
可以构造一个嵌套循环,遍历两个 Dataframe 的行,并将行号和sampleID存储在list中,其中df1的所有值都在df2的行中。

lst <- list()
ctr <- 1

for (i in seq_len(nrow(df1))) {
  for (j in seq_len(nrow(df2))) {
    if (all(df1[i, ] %in% df2[j, names(df2) != "sampleID"])) lst[[ctr]] <- list(i, df2[j, "sampleID"]); ctr <- ctr + 1
  }
}

然后,您可以将list转换为matrix,其中2列表示匹配的行号rnsampleID。您可以使用pivot_wider放入宽格式,然后连接回原始df1数据。

library(tidyverse)

data.frame(matrix(unlist(lst), ncol = 2, byrow = TRUE, dimnames = list(NULL, c("rn", "sampleID")))) %>%
  mutate(x = row_number(), .by = "rn") %>%
  pivot_wider(id_cols = rn, values_from = sampleID, names_from = x, names_prefix = "sampleID.") %>%
  right_join(mutate(df1, rn = as.character(seq.int(nrow(df1))))) %>%
  relocate(starts_with("sampleID"), .after = last_col())

产出

rn    entry1 entry2 entry3 sampleID.1 sampleID.2
  <chr> <chr>  <chr>  <chr>  <chr>      <chr>     
1 1     A      D      G      1001       NA        
2 2     B      E      H      1001       1002      
3 3     C      F      I      NA         NA

相关问题