如何删除在其他字符串中检测到/包含的字符串,但仅在指定的group_by()参数中

7gcisfzg  于 2023-05-26  发布在  其他
关注(0)|答案(3)|浏览(101)

假设我有:

> w
   digest    gene          seq
1     InS  AB0583          AAB
2     InS  AB0583        AABKR
3     InS  AB0583      GFHGHGG
4     PAC PU83022          EUT
5     PAC PU83022      HSFSFJF
6     PAC PU83022        EUTCK
7     PAC PU83022       EUTCKJ
8     InS PO93853         HDGJ
9     InS PO93853        HDGJU
10    InS PO93853       YTYEYD
11    InS PO93853 YTYEYDJHSGSG
12    InS PO93853   SALGHAGGEE

我应用了两种不同的方法来识别蛋白质(用它们的基因名称w$gene解码)。这些方法以w$digest编码。正如您所看到的,在每个w$digest内的每个w$gene内可能存在w$seq的重叠序列-例如EUT也在EUTCK内,EUTCKEUTCKJ内。
我想知道w$seq中的每个字母有多少个独特的氨基酸被识别出来。因此,我需要删除在另一个字符串中检测到的任何/所有字符串,但仅当grouped_by(digest, gene)。保留字符数最多的字符串。
我在tidyverse中寻求解决方案

    • 需要帮助:**

(1)计算字符数,并按如下方式排列:

w <- w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count)

所以

> w
   digest    gene          seq count
1     InS  AB0583          AAB     3
2     InS  AB0583        AABKR     5
3     InS  AB0583      GFHGHGG     7
4     InS PO93853         HDGJ     4
5     InS PO93853        HDGJU     5
6     InS PO93853       YTYEYD     6

(2)group_by(digest, gene),现在删除包含在另一个w$seq(在此分组内)中检测到的w$seq的行,并保留w$seq具有最多字符的行。

    • 输出**
> w
   digest    gene          seq count
1     InS  AB0583          AAB     3 #* found within:
2     InS  AB0583        AABKR     5 #*
3     InS  AB0583      GFHGHGG     7
4     InS PO93853         HDGJ     4 #** found within:
5     InS PO93853        HDGJU     5 #**
6     InS PO93853       YTYEYD     6 #***
7     InS PO93853   SALGHAGGEE    10
8     InS PO93853 YTYEYDJHSGSG    12 #***
9     PAC PU83022          EUT     3 #****
10    PAC PU83022        EUTCK     5 #****
11    PAC PU83022       EUTCKJ     6 #****
12    PAC PU83022      HSFSFJF     7

因此,预期输出

> w
   digest    gene          seq count
1     InS  AB0583        AABKR     5 
2     InS  AB0583      GFHGHGG     7
3     InS PO93853        HDGJU     5 
4     InS PO93853   SALGHAGGEE    10
5     InS PO93853 YTYEYDJHSGSG    12 
6     PAC PU83022       EUTCKJ     6 
7     PAC PU83022      HSFSFJF     7
    • 数据**
w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)
ibrsph3r

ibrsph3r1#

对于group_by()中的每个组,您可以创建一个新的列表列,其中每行包含该组的所有seq值。然后,您可以执行一个逐行操作,计算seq的每个值在所有值中出现的次数。保留那些只出现一次的会给予你想要的结果。

library(dplyr)
library(stringr)
w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

w <- w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count) 

w %>% group_by(digest, gene) %>%
  mutate(all_vals = list(seq)) %>% 
  rowwise() %>% 
  mutate(win = sum(grepl(seq, all_vals))) %>% 
  filter(win == 1) %>% 
  dplyr::select(-c(win, all_vals))
#> # A tibble: 7 × 4
#> # Rowwise:  digest, gene
#>   digest gene    seq          count
#>   <chr>  <chr>   <chr>        <int>
#> 1 InS    AB0583  AABKR            5
#> 2 InS    AB0583  GFHGHGG          7
#> 3 InS    PO93853 HDGJU            5
#> 4 InS    PO93853 SALGHAGGEE      10
#> 5 InS    PO93853 YTYEYDJHSGSG    12
#> 6 PAC    PU83022 EUTCKJ           6
#> 7 PAC    PU83022 HSFSFJF          7

创建于2023-05-25带有reprex v2.0.2

mutmk8jj

mutmk8jj2#

编辑:

以下是一个可能的解决方案:

library(tidyverse)

w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count) %>%
  group_by(digest, gene) %>%
  filter(str_count(paste0(seq, collapse = "_"), seq) == 1)
#> # A tibble: 7 × 4
#> # Groups:   digest, gene [3]
#>   digest gene    seq          count
#>   <chr>  <chr>   <chr>        <int>
#> 1 InS    AB0583  AABKR            5
#> 2 InS    AB0583  GFHGHGG          7
#> 3 InS    PO93853 HDGJU            5
#> 4 InS    PO93853 SALGHAGGEE      10
#> 5 InS    PO93853 YTYEYDJHSGSG    12
#> 6 PAC    PU83022 EUTCKJ           6
#> 7 PAC    PU83022 HSFSFJF          7

创建于2023-05-25带有reprex v2.0.2

原始答案:

这有点尴尬,但它“工作”:

library(tidyverse)

w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count) %>%
  group_by(digest, gene) %>%
  mutate(strings = paste0(seq, collapse = "|")) %>%
  rowwise() %>%
  mutate(strings = gsub(paste0("\\b", seq, "\\b"), "", strings)) %>%
  filter(!grepl(seq, strings)) %>%
  select(-strings) %>%
  ungroup()
#> # A tibble: 7 × 4
#>   digest gene    seq          count
#>   <chr>  <chr>   <chr>        <int>
#> 1 InS    AB0583  AABKR            5
#> 2 InS    AB0583  GFHGHGG          7
#> 3 InS    PO93853 HDGJU            5
#> 4 InS    PO93853 SALGHAGGEE      10
#> 5 InS    PO93853 YTYEYDJHSGSG    12
#> 6 PAC    PU83022 EUTCKJ           6
#> 7 PAC    PU83022 HSFSFJF          7

创建于2023-05-25带有reprex v2.0.2

5jdjgkvh

5jdjgkvh3#

使用 base

# add count and sort
w$count <- nchar(w$seq)
w <- w[ with(w, order(digest, gene, count)), ]

# subset
w[ sapply(Map(grepl, w$seq, ave(w$seq, w[ 1:2 ], FUN = list)), sum) == 1, ]
#    digest    gene          seq count
# 2     InS  AB0583        AABKR     5
# 3     InS  AB0583      GFHGHGG     7
# 9     InS PO93853        HDGJU     5
# 12    InS PO93853   SALGHAGGEE    10
# 11    InS PO93853 YTYEYDJHSGSG    12
# 7     PAC PU83022       EUTCKJ     6
# 5     PAC PU83022      HSFSFJF     7

相关问题