如何最大化R fuzzyjoin/stringdist速度和内存效率

jutyujz0 于 2024-01-03 发布在其他

关注(0)|答案(1)|浏览(120)

我有2个 Dataframe 包含短（长度== 20）序列，我想用字符串距离分析技术进行比较，返回高度相似的序列，汉明距离不大于3（即查询序列和主题序列之间不超过3个替换）。fuzzyjoin：：stringdist_join（）很好地完成了这个任务，但它无法处理我想要比较的序列数量（每个 Dataframe 中有数万到数十万个序列），除非我在查询序列中分块。当我的 Dataframe 位于较大的一侧时，这种策略开始需要花费整整一天的时间来执行下面的代码。
有没有什么方法可以使用fuzzyjoin或stringdist包和data.table来加快速度并保留内存？我一直在尝试各种方法，但它们导致执行速度更慢。

library(tidyverse)
library(fuzzyjoin)

### simulate data ###

chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq), 
                    seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
                      seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))

### return seqs with 3 or less mismatches ###

# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
#                 by = "seq",
#                 method = "hamming",
#                 mode = "left",
#                 max_dist = 3,
#                 distance_col = "mismatches")

# chunk query values to preserve memory
query <- query %>%
  mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)

# get a variable of all groups
var.grps <- unique(query$grp)

# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
  q <- filter(query, grp == i)
  dat <- stringdist_join(q, subject,
                         by = "seq",
                         max_dist = 3,
                         method = "hamming",
                         mode = "left",
                         ignore_case = TRUE,
                         distance_col = "mismatch")
  return(dat)
})

字符串

r

来源：https://stackoverflow.com/questions/77688123/how-to-maximize-r-fuzzyjoin-stringdist-speed-and-memory-efficiency

1条答案

按热度按时间

oyxsuwqo1#

我想明白了：stringdist_join（）在后台使用stringdistmatrix（）。使用stringdistmatrix（）并从中收集所需的信息要快得多。为了克服内存问题，我在查询序列中使用初始空矩阵进行分块。

# make stringdist matrix 
chunk_size <- 1000
num_rows <- nrow(query)

# Initialize an empty matrix
sdm <- matrix(0, nrow = num_rows, ncol = nrow(subject))

# Loop through the rows in chunks
for (start_row in seq(1, num_rows, by = chunk_size)) {
  end_row <- min(start_row + chunk_size - 1, num_rows)
  
  # Subset the rows for the current chunk
  chunk_query <- query$seq[start_row:end_row]
  
  # Compute stringdist matrix for the current chunk
  chunk_sdm <- stringdistmatrix(chunk_query, subject$seq, method = "hamming")
  
  # Assign the chunk_sdm to the corresponding rows in the main sdm matrix
  sdm[start_row:end_row, ] <- chunk_sdm
}
rownames(sdm) <- query$name
colnames(sdm) <- subject$name

# find the indices where the values are 3 or less
indices <- which(sdm <= 3, arr.ind = TRUE)

# extract row names, col names, and values based on the indices
result <- data.frame(query = rownames(sdm)[indices[, 1]],
                     subject = colnames(sdm)[indices[, 2]],
                     mismatch = sdm[indices])

字符串

赞(0）回复(0）举报 2024-01-03

我来回答

如何最大化R fuzzyjoin/stringdist速度和内存效率

1条答案

相关问题

热门标签

最新问答