哪些情况可以从Perl的研究中获益?

uhry853o  于 2022-12-19  发布在  Perl
关注(0)|答案(4)|浏览(123)

我正在研究study,这是一个Perl特性,用于检查字符串,以使后续正则表达式可能更快:

while( <> ) {
    study;
    $count++ if /PATTERN/;
    $count++ if /OTHER/;
    $count++ if /PATTERN2/;
    }

关于哪些情况会从中受益,我们没有太多的论述,您可以从the docs中梳理出以下几点:

  • 具有常量字符串的模式
  • 多种模式
  • 目标字符串越短越好(学习时间越少)

我在寻找一些具体的案例,不仅可以证明它有很大的优势,还可以稍微调整一下,使其失去优势。the docs中的一个警告是,你应该对个别案例进行基准测试。我想找到一些边缘案例,在这些案例中,字符串(或模式)的微小差异会导致性能的巨大差异。
如果你没有使用过study,请不要回答。我宁愿得到格式正确的答案,而不是快速的猜测。这里没有紧急情况,也没有耽误任何工作。
作为奖励,我一直在使用一个基准测试工具来比较NYTProf的两次运行,我宁愿使用它,而不是通常的基准测试工具。如果我想出了一个自动化的方法,我也会分享它。

txu3uszq

txu3uszq1#

谷歌找到了这个lovely test scenario

#!/usr/bin/perl
# 
#  Exercise 7.8 
# 
# This is a more difficult exercise. The study function in Perl may speed up searches 
# for motifs in DNA or protein. Read the Perl documentation on this function. Its use 
# is simple: given some sequence data in a variable $sequence, type:
# 
# study $sequence;
# 
# before doing the searches. Do you think study will speed up searches in DNA or 
# protein, based on what you've read about it in the documentation?
# 
# For lots of extra credit! Now read the Perl documentation on the standard module 
# Benchmark. (Type perldoc Benchmark, or visit the Perl home page at http://www.
# perl.com.) See if your guess is right by writing a program that benchmarks motif 
# searches of DNA and of protein, with and without study.
#
# Answer to Exercise 7.8

use strict;
use warnings;

use Benchmark;

my $dna = join ('', qw(
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
));

my $protein = join('', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
));

my $count = 1000;

print "DNA pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nDNA pattern matches with 'study' function:\n";
timethis($count,
    ' study $dna;
    for(my $i=1 ; $i < 10000; ++$i) {
        $dna =~ /aggtc/;
        $dna =~ /aatggccgt/;
        $dna =~ /gatcgatcagctagcat/;
        $dna =~ /gtatgaac/;
        $dna =~ /[ac][cg][gt][ta]/;
        $dna =~ /ccccccccc/;
    } '
);

print "\nProtein pattern matches without 'study' function:\n";
timethis($count,
    ' for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

print "\nProtein pattern matches with 'study' function:\n";
timethis($count,
    ' study $protein;
    for(my $i=1 ; $i < 10000; ++$i) {
        $protein =~ /PH.EI/;
        $protein =~ /KFTEQGESMRLY/;
        $protein =~ /[YAL][NVP][ISV][KQE]/;
        $protein =~ /DKKQIR/;
        $protein =~ /[MD][VT][HQ][ER]/;
        $protein =~ /NVPISVKQEITFTDVSEQL/;
    } '
);

请注意,对于利润最高的情况(蛋白质匹配),报告的收益仅约为2%:

#  $ perl exer07.08
# On my computer, this is the output I get: your results probably vary.

#  DNA pattern matches without 'study' function:
#  timethis 1000: 29 wallclock secs (29.25 usr +  0.00 sys = 29.25 CPU) @ 34.19/s (n=1000)
#  
#  DNA pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (29.21 usr +  0.15 sys = 29.36 CPU) @ 34.06/s (n=1000)
#  
#  Protein pattern matches without 'study' function:
#  timethis 1000: 32 wallclock secs (29.47 usr +  0.04 sys = 29.51 CPU) @ 33.89/s (n=1000)
#  
#  Protein pattern matches with 'study' function:
#  timethis 1000: 30 wallclock secs (28.97 usr +  0.02 sys = 28.99 CPU) @ 34.49/s (n=1000)
#
ecbunoof

ecbunoof2#

我会留下笔记作为答案,稍后我会把它发展成一个实际的答案:
pp.cPP(pp_study)中,它有以下几行奇怪的代码(没有注解):

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
RETPUSHNO;
}

看起来带UTF8标志的标量根本没有被研究过。

6tr1vspr

6tr1vspr3#

不完全是。如果你搜索,大多数结果是在Perl测试套件,这意味着没有人使用它。而且,由于bug,你只能notice speed benefits on global variables。它实际上带来了一些速度增强时,处理英语(有时甚至快2倍),但你必须使变量全局。
它有时也会导致infinite loopsfalse positivesstudy可能会给你的程序增加bug,即使它只是为了让程序更快),因此在Perl 5.16中它被删除了(或者更确切地说,使之成为空操作)--没有人想维护一个没有人关心的部分。

hgqdbh6s

hgqdbh6s4#

无。自2012年以来,study does nothing
目前代码有

if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
    /* Historically, study was skipped in these cases. */
    SETs(&PL_sv_no);
    return NORMAL;
}

/* Make study a no-op. It's no longer useful and its existence
   complicates matters elsewhere. */
SETs(&PL_sv_yes);
return NORMAL;

这意味着study在它以前已经做了一些事情的情况下返回true,否则返回false--但是它实际上从来没有做过任何事情。

相关问题