如何最大化R fuzzyjoin/stringdist速度和内存效率

jutyujz0  于 2024-01-03  发布在  其他
关注(0)|答案(1)|浏览(120)

我有2个 Dataframe 包含短(长度== 20)序列,我想用字符串距离分析技术进行比较,返回高度相似的序列,汉明距离不大于3(即查询序列和主题序列之间不超过3个替换)。fuzzyjoin::stringdist_join()很好地完成了这个任务,但它无法处理我想要比较的序列数量(每个 Dataframe 中有数万到数十万个序列),除非我在查询序列中分块。当我的 Dataframe 位于较大的一侧时,这种策略开始需要花费整整一天的时间来执行下面的代码。
有没有什么方法可以使用fuzzyjoin或stringdist包和data.table来加快速度并保留内存?我一直在尝试各种方法,但它们导致执行速度更慢。

library(tidyverse)
library(fuzzyjoin)

### simulate data ###

chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq), 
                    seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
                      seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))

### return seqs with 3 or less mismatches ###

# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
#                 by = "seq",
#                 method = "hamming",
#                 mode = "left",
#                 max_dist = 3,
#                 distance_col = "mismatches")

# chunk query values to preserve memory
query <- query %>%
  mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)

# get a variable of all groups
var.grps <- unique(query$grp)

# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
  q <- filter(query, grp == i)
  dat <- stringdist_join(q, subject,
                         by = "seq",
                         max_dist = 3,
                         method = "hamming",
                         mode = "left",
                         ignore_case = TRUE,
                         distance_col = "mismatch")
  return(dat)
})

字符串

oyxsuwqo

oyxsuwqo1#

我想明白了:stringdist_join()在后台使用stringdistmatrix()。使用stringdistmatrix()并从中收集所需的信息要快得多。为了克服内存问题,我在查询序列中使用初始空矩阵进行分块。

# make stringdist matrix 
chunk_size <- 1000
num_rows <- nrow(query)

# Initialize an empty matrix
sdm <- matrix(0, nrow = num_rows, ncol = nrow(subject))

# Loop through the rows in chunks
for (start_row in seq(1, num_rows, by = chunk_size)) {
  end_row <- min(start_row + chunk_size - 1, num_rows)
  
  # Subset the rows for the current chunk
  chunk_query <- query$seq[start_row:end_row]
  
  # Compute stringdist matrix for the current chunk
  chunk_sdm <- stringdistmatrix(chunk_query, subject$seq, method = "hamming")
  
  # Assign the chunk_sdm to the corresponding rows in the main sdm matrix
  sdm[start_row:end_row, ] <- chunk_sdm
}
rownames(sdm) <- query$name
colnames(sdm) <- subject$name

# find the indices where the values are 3 or less
indices <- which(sdm <= 3, arr.ind = TRUE)

# extract row names, col names, and values based on the indices
result <- data.frame(query = rownames(sdm)[indices[, 1]],
                     subject = colnames(sdm)[indices[, 2]],
                     mismatch = sdm[indices])

字符串

相关问题