从包含数百万个文件的目录(bash/python/perl)中高效地查找数千个文件并进行精确匹配

sg24os4d 于 2023-11-22 发布在 Perl

关注(0)|答案(5)|浏览(181)

我在Linux上，我试图从包含数百万个文件的目录（SOURCE_LIST）中查找数千个文件。我有一个需要查找的文件名列表，存储在一个文本文件（FILE_LIST）中。该文件的每行包含一个与SOURCE_LIST中的一个文件对应的名称，该文件中有数千行。

## FILE_LIST contain single word file names, each per line
#Name0001
#Name0002
#..
#Name9999

字符串
我想把文件复制到另一个目录（DESTINATION_CHINA）。我写了下面的循环，里面有一个循环来逐个查找。

#!/bin/bash
FILE_LIST='file.list'
## FILE_LIST contain single word file names, each per line
#Name0001
#Name0002
#..
#Name9999

SOURCE_DIR='/path/to/source/files' # Contain millions of files in sub-directories
DESTINATION_DIR='/path/to/destination/files' # Files will be copied to here

while read FILE_NAME
do
    echo $FILE_NAME
    for FILE_NAME_WITH_PATH in `find SOURCE_DIR -maxdepth 3 -name "$FILE_NAME*" -type f -exec readlink -f {} \;`; 
    do 
        echo $FILE
        cp -pv $FILE_NAME_WITH_PATH $DESTINATION_DIR; 
    done
done < $FILE_LIST

型
这个循环花费了很多时间，我想知道是否有更好的方法来实现我的目标。我搜索了，但没有找到解决我的问题的方法。如果已经存在解决方案，请直接给我，或者在上面的代码中提出任何调整建议。我也很好，如果另一种方法，甚至是python/perl解决方案。感谢您的时间和帮助！

perl

来源：https://stackoverflow.com/questions/61843060/find-thousands-of-files-efficiently-with-exact-match-from-a-directory-containing

5条答案

按热度按时间

35g0bw711#

需要找到要复制的文件，因为它们没有给出路径（不知道它们在哪个目录中），但是重新搜索每个文件是非常浪费的，大大增加了复杂性。
相反，首先为每个文件名构建一个具有完整路径名称的散列。
一种方法是使用Perl，利用快速核心模块File::Find

use warnings;
use strict;
use feature 'say';

use File::Find;
use File::Copy qw(copy);

my $source_dir = shift // '/path/to/source';  # give at invocation or default

my $copy_to_dir = '/path/to/destination';

my $file_list = 'file_list_to_copy.txt';  
open my $fh, '<', $file_list or die "Can't open $file_list: $!";
my @files = <$fh>;
chomp @files;

my %fqn;    
find( sub { $fqn{$_} = $File::Find::name  unless -d }, $source_dir );

# Now copy the ones from the list to the given location        
foreach my $fname (@files) { 
    copy $fqn{$fname}, $copy_to_dir  
        or do { 
            warn "Can't copy $fqn{$fname} to $copy_to_dir: $!";
            next;
        };
}

字符串
剩下的问题是文件名可能存在于多个目录中，但我们需要给出一个规则来做什么。
我忽略了在问题中使用的最大深度，因为它是无法解释的，在我看来似乎是与极端运行时相关的修复（？）。此外，文件被复制到一个“平面”结构（没有恢复其原始层次结构），从问题中得到提示。
最后，我只跳过目录，而其他各种文件类型都有自己的问题（复制链接需要注意）。
[2]需要澄清的是，在不同的目录中可能存在同名文件，这些文件应该被复制到相同的文件名中，并在扩展名之前加上一个序列号。
为此，我们需要检查一个名字是否已经存在，并跟踪重复的名字，同时构建哈希，所以这将花费更长的时间。那么，如何解决重复的名字呢？我使用另一个哈希，其中只保留了重复的名字，在arrayrefs中;这简化并加快了工作的两个部分。

my (%fqn, %dupe_names);
find( sub {
    return if -d;
    (exists $fqn{$_})
        ? push( @{ $dupe_names{$_} }, $File::Find::name )
        : ( $fqn{$_} = $File::Find::name );
}, $source_dir );

型
令我惊讶的是，即使现在对每个项目都运行一个测试，这也比不考虑重复名称的代码运行得慢一点，这些代码运行在一个庞大的层次结构中的25万个文件上。
在三元运算符中赋值的括号是需要的，因为运算符可以被赋值（如果最后两个参数是有效的“左值”，就像这里一样），所以需要小心分支内的赋值。
然后在复制%fqn后，如在文章的主要部分，也复制其他文件具有相同的名称。我们需要打破文件名，以便在.ext之前添加枚举;我使用核心File::Basename

use File::Basename qw(fileparse);

foreach my $fname (@files) { 
    next if not exists $dupe_names{$fname};  # no dupe (and copied already)
    my $cnt = 1;
    foreach my $fqn (@{$dupe_names{$fname}}) { 
        my ($name, $path, $ext) = fileparse($fqn, qr/\.[^.]*/); 
        copy $fqn, "$copy_to_dir/${name}_$cnt$ext";
            or do { 
                warn "Can't copy $fqn to $copy_to_dir: $!";
                next;
            };
        ++$cnt;
    }
}

型
（已完成基本测试，但不多）
我可能会使用undef而不是上面的$path，以指示路径未使用（同时也避免了分配和填充标量），但为了清楚起见，我将其保留为那些不熟悉模块的sub返回的内容的人。

注意 * 如果您希望将它们 * 全部 * 编入索引，那么首先将fname.ext（在目标中，它已经通过%fqn复制）重命名为fname_1.ext，并将计数器初始化更改为my $cnt = 2;。

这些文件不一定是相同的文件。

赞(0）回复(0）举报 2023-11-22

jgzswidk2#

我怀疑速度问题是（至少部分）来自嵌套循环--对于每个FILE_NAME，您都在运行find并循环其结果。（它适用于大型列表，我已经在10万多个单词的列表上测试过），这样你只需要在文件上循环一次，然后让正则表达式引擎处理剩下的部分;它相当快。
注意，我根据对您的脚本的阅读做出了一些假设：您希望模式在文件名的开头区分大小写地匹配，并且希望在目标中重新创建与源相同的目录结构（如果不需要，请设置$KEEP_DIR_STRUCT=0）。此外，我使用的不是最佳实践的解决方案，而是find，而不是Perl自己的File::Find，因为它可以更容易地实现您正在使用的相同选项（例如-maxdepth 3）-但是它应该可以正常工作 * 除非 * 有任何文件在其名称中带有换行符。
这个脚本只使用核心模块，所以你应该已经安装了它们。

#!/usr/bin/env perl
use warnings;
use strict;
use File::Basename qw/fileparse/;
use File::Spec::Functions qw/catfile abs2rel/;
use File::Path qw/make_path/;
use File::Copy qw/copy/;

# user settings
my $FILE_LIST='file.list';
my $SOURCE_DIR='/tmp/source';
my $DESTINATION_DIR='/tmp/dest';
my $KEEP_DIR_STRUCT=1;
my $DEBUG=1;

# read the file list
open my $fh, '<', $FILE_LIST or die "$FILE_LIST: $!";
chomp( my @files = <$fh> );
close $fh;

# build a regular expression from the list of filenames
# explained at: https://www.perlmonks.org/?node_id=1179840
my ($regex) = map { qr/^(?:$_)/ } join '|', map {quotemeta}
    sort { length $b <=> length $a or $a cmp $b } @files;

# prep dest dir
make_path($DESTINATION_DIR, { verbose => $DEBUG } );

# use external "find"
my @cmd = ('find',$SOURCE_DIR,qw{ -maxdepth 3 -type f -exec readlink -f {} ; });
open my $cmd, '-|', @cmd or die $!;
while ( my $srcfile = <$cmd> ) {
    chomp($srcfile);
    my $basename = fileparse($srcfile);
    # only interested in files that match the pattern
    next unless $basename =~ /$regex/;
    my $newname;
    if ($KEEP_DIR_STRUCT) {
        # get filename relative to the source directory
        my $relname = abs2rel $srcfile, $SOURCE_DIR;
        # build new filename in destination directory
        $newname = catfile $DESTINATION_DIR, $relname;
        # create the directories in the destination (if necessary)
        my (undef, $dirs) = fileparse($newname);
        make_path($dirs, { verbose => $DEBUG } );
    }
    else {
        # flatten the directory structure
        $newname = catfile $DESTINATION_DIR, $basename;
        # warn about potential naming conflicts
        warn "overwriting $newname with $srcfile\n" if -e $newname;
    }
    # copy the file
    print STDERR "cp $srcfile $newname\n" if $DEBUG;
    copy($srcfile, $newname) or die "copy('$srcfile', '$newname'): $!";
}
close $cmd or die "external command failed: ".($!||$?);

字符串
您可能还需要考虑使用硬链接而不是复制文件。

赞(0）回复(0）举报 2023-11-22

iecba09b3#

使用`rsync`

我不知道这对于数百万个文件来说会有多快，但这里有一个使用rsync的方法。
将file.list设置为如下所示（例如：such as with$ cat file.list | awk '{print "+ *" $0}'）。

+ *Name0001
+ *Name0002
...
+ *Name9999

字符串
在rsync命令中使用--include=from选项调用file.list：

$ rsync -v -r --dry-run --filter="+ **/" --include-from=/tmp/file.list --filter="- *" /path/to/source/files /path/to/destination/files

型
选项说明：

-v                  : Show verbose info.
-r                  : Traverse directories when searching for files to copy.
--dry-run           : Remove this if preview looks okay
--filter="+ *./"    : Pattern to include all directories in search
--include-from=/tmp/file.list  : Include patterns from file.
--filter="- *"      : Exclude everything that didn't match previous patterns.

型
Option order matters的一个。
如果详细信息看起来可以接受，请删除--dry-run。
使用rsync版本3.1.3进行测试。

赞(0）回复(0）举报 2023-11-22

xam8gpfp4#

这里是find的xv4+解决方案，但不确定速度。

#!/usr/bin/env bash

files=file.list
sourcedir=/path/to/source/files
destination=/path/to/destination/files
mapfile -t lists < "$files"
total=${#lists[*]}

while IFS= read -rd '' files; do
  counter=0
  while ((counter < total)); do
    if [[ $files == *"${lists[counter]}" ]]; then
      echo cp -v "$files" "$destination" && unset 'lists[counter]' && break
    fi
    ((counter++))
  done
  lists=("${lists[@]}")
  total=${#lists[*]}
  (( ! total )) && break  ##: if the lists is already emtpy/zero, break.
done < <(find "$sourcedir" -type f -print0)

字符串

如果在file.list和source_directory中的文件中找到匹配，内部break将退出内部循环，因此它不会处理file.list直到最后，并且它将删除unset中的"${lists[@]}"（这是一个数组）中的条目，因此下一个内部循环将跳过已经匹配的文件。
文件名冲突不应该是一个问题，unset和内部break确保了这一点。不利的一面是，如果你有多个文件匹配在不同的子目录。
如果速度是你正在寻找的，那么使用一般的脚本语言，如python，perl和朋友

循环中模式匹配（极其缓慢）的替代方法是grep

#!/usr/bin/env bash

files=file.list
source_dir=/path/to/source/files
destination_dir=/path/to/destination/files

while IFS= read -rd '' file; do
  cp -v "$file" "$destination_dir"
done < <(find "$source_dir" -type f -print0 | grep -Fzwf "$files")

型

来自grep的-z是GNU扩展。
如果您认为输出正确，请删除echo。

赞(0）回复(0）举报 2023-11-22

j1dl9f465#

用grep代替find来尝试locate。我使用的是文件索引db，因此应该相当快。记住提前运行sudo updatedb来更新db。

赞(0）回复(0）举报 2023-11-22

我来回答

从包含数百万个文件的目录(bash/python/perl)中高效地查找数千个文件并进行精确匹配

5条答案

使用`rsync`

相关问题

热门标签

最新问答

从包含数百万个文件的目录(bash/python/perl)中高效地查找数千个文件并进行精确匹配

5条答案

使用rsync

相关问题

热门标签

最新问答

使用`rsync`