我有2个 Dataframe 包含短(长度== 20)序列,我想用字符串距离分析技术进行比较,返回高度相似的序列,汉明距离不大于3(即查询序列和主题序列之间不超过3个替换)。fuzzyjoin::stringdist_join()很好地完成了这个任务,但它无法处理我想要比较的序列数量(每个 Dataframe 中有数万到数十万个序列),除非我在查询序列中分块。当我的 Dataframe 位于较大的一侧时,这种策略开始需要花费整整一天的时间来执行下面的代码。
有没有什么方法可以使用fuzzyjoin或stringdist包和data.table来加快速度并保留内存?我一直在尝试各种方法,但它们导致执行速度更慢。
library(tidyverse)
library(fuzzyjoin)
### simulate data ###
chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq),
seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
### return seqs with 3 or less mismatches ###
# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
# by = "seq",
# method = "hamming",
# mode = "left",
# max_dist = 3,
# distance_col = "mismatches")
# chunk query values to preserve memory
query <- query %>%
mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)
# get a variable of all groups
var.grps <- unique(query$grp)
# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
q <- filter(query, grp == i)
dat <- stringdist_join(q, subject,
by = "seq",
max_dist = 3,
method = "hamming",
mode = "left",
ignore_case = TRUE,
distance_col = "mismatch")
return(dat)
})
字符串
1条答案
按热度按时间oyxsuwqo1#
我想明白了:stringdist_join()在后台使用stringdistmatrix()。使用stringdistmatrix()并从中收集所需的信息要快得多。为了克服内存问题,我在查询序列中使用初始空矩阵进行分块。
字符串