R语言具有字母注解的序列的“位置感知”比对

bcs8qyzn 于 2023-05-26 发布在其他

关注(0)|答案(3)|浏览(167)

我们有两个DNA序列（字符串）：

>1
ATGCAT
135198
>2
ATCAT

预期输出：首先，我们需要对齐这两个字符串，然后通过索引获得相关的注解：

ATGCAT
AT-CAT
13-198

第一部分可以使用 * Biosrings * 包完成：

library(Biostrings)

p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"

x <- pairwiseAlignment(pattern = p, subject = s)

aligned(x)
# A DNAStringSet instance of length 1
#     width seq
# [1]     6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A"  "T"  "-"  "C"  "A"  "T"

第2部分的当前解决方法：

annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"

工作正常，但我想知道是否有一种 Biostrings 的方法（或任何其他包），也许通过将注解保留在 metadata 插槽中，然后在对齐后，我们在 metadata 中获得匹配碱基的匹配注解，如下所示：

getSlots("DNAString")
#      shared              offset              length     elementMetadata            metadata 
# "SharedRaw"           "integer"           "integer" "DataTable_OR_NULL"              "list" 

# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"

注：

BioStars - https://www.biostars.org/p/415896/
作者想要 biopython 解决方案，所以如果可能的话，也添加标签post python 解决方案。

来源：https://stackoverflow.com/questions/59682107/position-aware-aligning-of-sequences-with-letter-annotations

3条答案

按热度按时间

lp0sw83n1#

可能的解决方案：

dna_fun <- function(s, p, a) {
  s <- strsplit(s, "")[[1]]
  p <- strsplit(p, "")[[1]]
  a <- strsplit(a, "")[[1]]
  ls <- length(s)
  lp <- length(p)

  r <- lapply(c(1,seq(lp)), function(x) {
    v <- rep(1, 5)
    v[x] <- 2
    v
  })

  mat <- sapply(r, rep, x = p)
  tfm <- mat == matrix(rep(s, ls), ncol = ls)
  m <- which.max(colSums(tfm))

  p2 <- mat[, m]
  p2[!tfm[,m]] <- "-"

  a[!tfm[,m]] <- "-"

  p2 <- paste(p2, collapse = "")
  a <- paste(a, collapse = "")

  return(list(p2, a))
}

其中：

dna_fun(s1, s2, annot)

你会得到：

[[1]]
[1] "AT-CAT"

[[2]]
[1] "13-198"

如果你有对应的向量，你可以使用Map和dna_fun-函数：

s11 <- c("ATGCAT","ATCGAT")
s22 <- c("ATCAT","ATCAT")
annot2 <- c("135198","145892")

lm <- Map(dna_fun, s11, s22, annot2)

data.table::rbindlist(lm, idcol = "dna")

这给出：

dna     V1     V2
1: ATGCAT AT-CAT 13-198
2: ATCGAT ATC-AT 145-92

数据：

s1 <- "ATGCAT"
s2 <- "ATCAT"
annot <- "135198"

赞(0）回复(0）举报 2023-05-26

gfttwv5a2#

根据要求，Biopython解决方案：

from Bio import Align

p = "ATCAT"
s = "ATGCAT"
s_annot = "135198"

aligner = Align.PairwiseAligner()
alignment = str(aligner.align(p, s)[0]).split()
middle = alignment.pop(1)
alignment.append("".join(c if m == "|" else m for c, m in zip(s_annot, middle)))

print("\n".join(alignment))

输出：

AT-CAT
ATGCAT
13-198

赞(0）回复(0）举报 2023-05-26

g2ieeal73#

答案来自-https://www.biostars.org/p/415896/#9564298
这个问题又出现在我的收件箱里了，所以我想我会带着我想出的解决方案回来。
我相信我最终的解决方案是基于德文郡瑞安的一个想法。
它被嵌入到一个更复杂的代码库中，但总体思路是这样的：
1.正常创建序列的成对比对。
1.迭代比对并创建一种捕获各种序列“移动”的CIGAR字符串。
1.将这组相同的移动应用/Map到对应于每个位置的注解列表，从而重复这组操作
代码：
https://github.com/jrjhealey/ISIS/blob/master/ImmunoRender.py#L35-L103

赞(0）回复(0）举报 2023-05-26

我来回答

R语言具有字母注解的序列的“位置感知”比对

3条答案

相关问题

热门标签

最新问答

R语言 具有字母注解的序列的“位置感知”比对

3条答案

相关问题

热门标签

最新问答

R语言具有字母注解的序列的“位置感知”比对