Perl:与大型数组匹配

a14dhokn  于 2022-12-13  发布在  Perl
关注(0)|答案(5)|浏览(131)

假设我有$sentence = "The quick brown fox jumps over the lazy dog"和一个数组@targets = qw(brown red black lazy quick);
我们的目标是找出@targets中的所有单词,即包含在$sentence中的单词,在本例中,就是这三个单词:brown lazy quick .
注意:@targets数组可能非常大,比如有10,000个元素,也可能有10,000个$sentence需要处理。
有没有一种高效而优雅的方法来做到这一点?

vhipe2zx

vhipe2zx1#

看看这个逻辑是否有助于您:

#!/usr/bin/perl

use strict;
use warnings;

my $sentence = "The quick brown fox jumps over the lazy dog";

my @targets = qw/brown red black lazy quick/;
my %targets = map { $_ => 1 } @targets;

my @words = split(' ', $sentence);

foreach my $each_word ( @words ){
    print $each_word."\n" if $targets{$each_word};
}

输出

quick
brown
lazy
56lgkhnf

56lgkhnf2#

使用Algorithm::AhoCorasick::XS可以非常高效地搜索多个字符串:

#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;

my @targets = qw(brown red black lazy quick);
my $matcher = Algorithm::AhoCorasick::XS->new(\@targets);

my @sentences = (
    "The quick brown fox jumps over the lazy dog",
    "The man in black fled across the desert and the gunslinger followed."
);

for my $sentence (@sentences) {
    my @matches = $matcher->matches($sentence); # Or unique_matches
    say "@matches" if @matches;
}

示例:

$ perl demo.pl
quick brown lazy
black
3pvhb19x

3pvhb19x3#

以下是使用Regexp::Assemble的示例:

use feature qw(say);
use strict;
use warnings;
use Regexp::Assemble;

my $str = "The quick brown fox jumps over the lazy dog";
my @targets = qw(brown lazy quick);
my $ra = Regexp::Assemble->new;
$ra->add($_) for @targets;
my $regex = $ra->re;
my @matches = $str =~ /($regex)/g;
say for @matches;  # optionally add List::Util::uniq if you want the list 
                     to be unique

输出

quick
brown
lazy
wribegjk

wribegjk4#

我对其他答案中的方法进行了快速比较:

use feature qw(say);
use strict;
use warnings;
use Algorithm::AhoCorasick::XS;
use Benchmark qw(cmpthese);
use Regexp::Assemble;

{
    my @words = ("barbarous", "blandit", "celery", "gravida", "pencil", "move", "rain",
            "diligent", "letter", "staking", "enjoy", "salty", "abnormal", "volatile",
            "scared", "unsightly", "curabitor", "possible", "morbi", "open", "road");

    my $str = <<'END';
Lorem ipsum dolor sit amet consectetur adipiscing elit Vivamus tempor nulla condimentum ex posuere sed vulputate purus laoreet Etiam ut justo pulvinar lobortis leo in dictum ligula Pellentesque porta et justo a vestibulum Fusce et nulla vitae nisi scelerisque iaculis Donec non mi ligula Donec et ullamcorper eros Duis tincidunt ante dui vel vestibulum risus gravida molestie Nam eleifend placerat enim ultricies eleifend Pellentesque dapibus orci quis nulla faucibus ut rhoncus urna interdum Nam lobortis vitae nunc vel bibendum Nulla sed condimentum justo
Donec mauris felis fermentum eu libero sed cursus fermentum nunc Fusce dolor sapien ultrices nec egestas vel aliquam in mi Duis ullamcorper elit ut ornare tincidunt urna velit ultrices est at molestie urna risus faucibus libero Cras nec quam in orci efficitur tristique venenatis sit amet odio Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas Pellentesque pretium metus eu eros vulputate non ornare orci imperdiet Ut pharetra efficitur urna sit amet sodales Donec pharetra eros sit amet dolor ullamcorper aliquet Cras ac ante sit amet erat fermentum consequat eget a turpis Donec a tellus ex Sed suscipit posuere hendrerit Praesent suscipit ante et tortor tempor congue Lorem ipsum dolor sit amet consectetur adipiscing elit Nunc tincidunt erat sit amet venenatis finibus ante massa ornare lacus at euismod tortor neque at velit
Sed ultricies velit ante vel convallis massa accumsan a Curabitur mauris purus tempus eu leo vitae fringilla porttitor augue Pellentesque mattis sed turpis quis viverra Aenean a mauris nec urna sollicitudin bibendum Etiam elementum efficitur lacus nec finibus Morbi ut dolor iaculis porttitor magna quis posuere arcu Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Duis at luctus sem Suspendisse vel erat porttitor lobortis tortor ac suscipit nulla Cras maximus dapibus odio sit amet aliquet Curabitur iaculis odio at aliquet tincidunt Suspendisse potenti Vivamus maximus est id rhoncus consectetur nisi purus rutrum massa nec ullamcorper risus nisi ut metus
Phasellus lobortis mauris turpis ut sollicitudin felis ultrices vel Ut tristique felis gravida est varius rutrum Sed porttitor pharetra sapien ac gravida urna Curabitur quis dignissim turpis Phasellus turpis sapien scelerisque vitae turpis tincidunt faucibus tempus tortor Proin bibendum elementum libero fermentum rhoncus Ut scelerisque sem eget rutrum scelerisque ex felis scelerisque augue vitae elementum lorem enim sit amet erat Sed congue volutpat mi pellentesque dapibus mi tempor quis Mauris nibh lorem mattis in nibh sed lacinia sagittis urna Etiam posuere odio non nisi placerat fermentum Mauris elit massa fermentum quis hendrerit ut condimentum a risus
Cras ut leo rhoncus laoreet turpis quis consectetur sem Fusce gravida neque et mauris molestie fermentum Proin non vehicula tellus vel volutpat est Nullam quis ornare ligula Nunc sed quam orci Nullam enim dolor cursus interdum fermentum sit amet gravida eu ex Aliquam nec porta mi Ut varius ut dolor vel convallis Curabitur massa purus varius nec diam eget sollicitudin finibus nisi Mauris pharetra in odio in rutrum Vestibulum vel eleifend ex Mauris tellus ex tempor ut dolor luctus facilisis pharetra justo
Cras elementum enim a egestas bibendum sapien ligula convallis tellus in dignissim augue massa vel diam Fusce quis mi diam Phasellus mollis est non ultricies dictum Nullam ullamcorper leo quis lacinia scelerisque libero enim faucibus dui placerat sodales dolor lectus sed ligula Nunc interdum justo vitae faucibus faucibus neque dui cursus nunc ut gravida dui urna sed ipsum Ut luctus lacus nulla id imperdiet elit fermentum sit amet Mauris nulla erat gravida ac enim ut blandit blandit felis Ut feugiat consequat vulputate Mauris erat neque vehicula ac purus ut tristique fringilla nulla
END
    
    cmpthese(
        -1,
        {
            regexp_assemble => sub { regexp_assemble( $str, \@words ) },
            aho_corasick    => sub { aho_corasick( $str, \@words ) },
            pp_trie         => sub { pp_trie( $str, \@words ) },
            hash_lookup     => sub { hash_lookup( $str, \@words ) },
        }
    );
}

sub hash_lookup {
    my ( $str, $targets ) = @_;

    my %targets = map { $_ => 1 } @$targets;
    my @words = split ' ', $str;
    my @matches;
    foreach my $each_word ( @words ){
        push @matches, $each_word if $targets{$each_word};
    }
    return \@matches;
}

sub pp_trie {
    my ( $str, $targets ) = @_;

    my $filter    = join '|', map {quotemeta} @$targets;
    my $re_filter = qr/\b($filter)\b/;
    my @matches = $str =~ /$re_filter/g;
    return \@matches;
}

sub aho_corasick {
    my ( $str, $targets ) = @_;

    my $matcher = Algorithm::AhoCorasick::XS->new($targets);
    my @matches = $matcher->matches($str);
    return \@matches;
}

sub regexp_assemble {
    my ( $str, $targets ) = @_;
    my $ra = Regexp::Assemble->new;
    $ra->add($_) for @$targets;
    my $regex = $ra->re;
    my @matches = $str =~ /\b($regex)\b/g;
    return \@matches;
}

输出

Rate regexp_assemble   hash_lookup       pp_trie aho_corasick
regexp_assemble  1935/s              --          -86%          -90%         -90%
hash_lookup     14221/s            635%            --          -24%         -28%
pp_trie         18618/s            862%           31%            --          -5%
aho_corasick    19651/s            916%           38%            6%           --

因此,对于此设置,aho_corasick是最快的。

编辑

如果我们提前做好第一步准备,我们会发现aho_corasick甚至更快:

# ... (previous part as before)

    my $regexp_assemble = RegexpAssemble->new();
    my $aho_corasick = AhoCorasick->new();
    my $pp_trie = PPTrie->new();
    my $hash_lookup = HashLookUp->new();
    $regexp_assemble->prepare($str, \@words);
    $aho_corasick->prepare($str, \@words);
    $pp_trie->prepare($str, \@words);
    $hash_lookup->prepare($str, \@words);
    cmpthese(
        -1,
        {
            regexp_assemble => sub { $regexp_assemble->get_matches( $str, \@words ) },
            aho_corasick    => sub { $aho_corasick->get_matches( $str, \@words ) },
            pp_trie         => sub { $pp_trie->get_matches( $str, \@words ) },
            hash_lookup     => sub { $hash_lookup->get_matches( $str, \@words ) },
        }
    );
}

package HashLookUp;
use strict;
use warnings;
use experimental qw(declared_refs refaliasing signatures);

sub new( $class, %args ) { bless \%args, $class }

sub prepare( $self, $str, $targets ) {
    my %targets = map { $_ => 1 } @$targets;
    my @words = split ' ', $str;
    $self->{targets} = \%targets;
    $self->{words} = \@words;
}

sub get_matches( $self, $str, $targets ) {
    my \%targets = $self->{targets};
    my \@words = $self->{words};
    my @matches;
    foreach my $each_word ( @words ){
        push @matches, $each_word if $targets{$each_word};
    }
    return \@matches;
}

package PPTrie;
use strict;
use warnings;
use experimental qw(declared_refs refaliasing signatures);

sub new( $class, %args ) { bless \%args, $class }

sub prepare($self, $str, $targets ) {
    my $filter    = join '|', map {quotemeta} @$targets;
    $self->{re_filter} = qr/\b($filter)\b/;
}

sub get_matches( $self, $str, $targets ) {
    my $re_filter = $self->{re_filter};
    my @matches = $str =~ /$re_filter/g;
    return \@matches;
}

package AhoCorasick;
use strict;
use warnings;
use experimental qw(declared_refs refaliasing signatures);
use Algorithm::AhoCorasick::XS;

sub new( $class, %args ) { bless \%args, $class }

sub prepare( $self, $str, $targets ) {
    $self->{matcher} = Algorithm::AhoCorasick::XS->new($targets);
}

sub get_matches( $self, $str, $targets ) {
    my $matcher = $self->{matcher};
    my @matches = $matcher->matches($str);
    return \@matches;
}

package RegexpAssemble;
use strict;
use warnings;
use experimental qw(declared_refs refaliasing signatures);
use Regexp::Assemble;

sub new( $class, %args ) { bless \%args, $class }

sub prepare( $self, $str, $targets ) {
    my $ra = Regexp::Assemble->new;
    $ra->add($_) for @$targets;
    $self->{regex} = $ra->re;
}

sub get_matches($self, $str, $targets ) {
    my $regex = $self->{regex};
    my @matches = $str =~ /\b($regex)\b/g;
    return \@matches;
}

现在的结果是:

regexp_assemble  16440/s              --          -51%         -69%         -86%
pp_trie          33810/s            106%            --         -37%         -71%
hash_lookup      53592/s            226%           59%           --         -54%
aho_corasick    117029/s            612%          246%         118%           --
wj8zmpe1

wj8zmpe15#

仅使用纯Perl

use strict;
use warnings;
use feature 'say';

my $sentence = "The quick brown fox jumps over the lazy dog";

my @targets   = qw/brown red black lazy quick/;
my $filter    = join('|',@targets);
my $re_filter = qr/\b($filter)\b/;

my @found = $sentence =~ /$re_filter/g;

say for @found;

输出量

quick
brown
lazy

相关问题