R语言如何找到最接近的父目录，区分重复的文件名？

fykwrbwg 于 2023-09-27 发布在其他

关注(0)|答案(2)|浏览(91)

我有一个完整的路径名列表，我想从最近的父目录中返回路径，该路径消除了文件路径的重复。
下面是一个示例（我将重复的文件分组在同一行上）：

have <- c("/A/B/C/D", "/A/B/D", "/A/C/D",
          "/path/to/unique_file",
          "/path/to/another_unique_file",
          "/path/diverges/here/file", "/path/diverges/here1/file")

我想得到的是：

want <- c("B/C/D", "B/D", "A/C/D",
          "unique_file",
          "another_unique_file",
          "here/file", "here1/file")

> length(unique(want)) == length(have)
[1] TRUE

几个参数：
1.在这个例子中，我有3个D、2个file、1个unique_file和1个another_unique_file，但在我的实际问题中，可能有任意数量的重复文件名。
1.至少，整个路径名将是唯一的（正如您所期望的）。如果路径名是最短的（从右边算起）唯一路径，则可以返回完整的路径名。/A/B/C/D将使第一个文件不同，但B/C/D也使它不同，更短，这是我所需要的。
1.如第一组文件名所示，具有相同名称的文件可以嵌套在不同的深度。例如，D分别位于第三、第二和第二个队列中。
1.我倾向于使用基R函数，除非有一个包可以使这个算法更容易阅读，代码更短。
使用split很容易将它们分组：

> split(have, basename(have))
$another_unique_file
[1] "/path/to/another_unique_file"

$D
[1] "/A/B/C/D" "/A/B/D"   "/A/C/D"  

$file
[1] "/path/diverges/here/file"  "/path/diverges/here1/file"

$unique_file
[1] "/path/to/unique_file"

从这里，我希望得到一些帮助，从文件名向后工作，直到路径是唯一的。我曾经考虑过递归函数，但我不确定它是否适用于不同的深度。

来源：https://stackoverflow.com/questions/77130359/how-to-find-the-closest-parent-directory-that-distinguishes-duplicated-file-name

2条答案

按热度按时间

tktrz96b1#

这里有一个非递归方法。它的工作原理是从尾部开始，寻找重复的尾部。一旦尾部是唯一的，就将其包含在结果中。

have <- c("/A/B/C/D", "/A/B/D", "/A/C/D",
          "/path/to/unique_file",
          "/path/to/another_unique_file",
          "/path/diverges/here/file", "/path/diverges/here1/file")

# This isn't necessary from your description, but it might be 
# a good idea
have <- normalizePath(have, mustWork = FALSE)

# Split each path
splits <- strsplit(have, "/", fixed=TRUE)

# Reverse each path, dropping the initial empty entry
revsplits <- lapply(splits, function(x) rev(x[-1]))

# Find the length of each
lens <- sapply(revsplits, length)

# Assume all are duplicates to start
dup <- rep(TRUE, length(have))

# Initialize the results with NA to signal not done yet
result <- rep(NA, length(have))

# Loop through from last entry to first entry
for (i in seq_len(max(lens))) {
  entry <- sapply(revsplits, function(s) s[i])
  
  # It's a dup if not already non-dup, and this entry is duplicated
  dup <- dup & (duplicated(entry) | duplicated(entry, fromLast = TRUE))
  
  # If we have any new non-dups, record them
  done <- is.na(result) & !dup
  if (any(done))
    result[done] <- sapply(revsplits[done], function(x) paste(rev(x[1:i]), collapse = "/"))
  
  # If all are done, quit
  if (all(!dup)) break
}

# If any are left just copy them
result[dup] <- have[dup]

result
#> [1] "B/C/D"               "B/D"                 "/A/C/D"             
#> [4] "unique_file"         "another_unique_file" "here/file"          
#> [7] "here1/file"

创建于2023-09-18带有reprex v2.0.2

赞(0）回复(0）举报 2023-09-27

s1ag04yj2#

您可以使用递归方法，用一个子例程查找唯一路径，用另一个子例程提取并格式化它们。这两个都可以封装在一个函数中：

unique_filenames <- function(filepaths) {
  
  split_paths <- function(filepaths) {
    subgroups <- split(dirname(filepaths), basename(filepaths))
    clash <- lengths(subgroups) > 1
    if(any(clash)) subgroups[clash] <- lapply(subgroups[clash], split_paths)
    return(subgroups)
  }
  
  read_list <- function(li) {
    final <- !sapply(li, is.list)
    li[final] <- names(li)[final]
    if(any(final)) {
      li[!final] <- Map(\(x, nm) {
        names(x) <- paste(names(x), nm, sep = "/")
        read_list(x)
      }, li[!final], names(li)[!final])
    }
    return(unlist(unname(li)))
  }
  
  read_list(split_paths(filepaths))
}

测试，我们有

unique_filenames(have)
#> [1] "another_unique_file" "B/D"                 "A/C/D"              
#> [4] "B/C/D"               "here/file"           "here1/file"         
#> [7] "unique_file"

创建于2023-09-18带有reprex v2.0.2

赞(0）回复(0）举报 2023-09-27

我来回答

R语言如何找到最接近的父目录，区分重复的文件名？

2条答案

相关问题

热门标签

最新问答

R语言 如何找到最接近的父目录，区分重复的文件名？

2条答案

相关问题

热门标签

最新问答

R语言如何找到最接近的父目录，区分重复的文件名？