Perl - Regex只提取逗号分隔的字符串

swvgeqrz  于 2022-11-15  发布在  Perl
关注(0)|答案(4)|浏览(263)

我有一个问题,我希望有人可以帮助...
我有一个包含网页内容的变量(使用WWW::Mechanize抓取)。
变量包含如下数据:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

从上面的例子中我唯一感兴趣的是:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

我遇到的问题:

我尝试从变量中只提取逗号分隔的字符串,然后将这些字符串存储在数组中以供以后使用。
但是,什么是最好的方法来确保我得到的字符串在开始(即猫_狗)和结束(即鸡Pig)的逗号分隔的动物列表,因为他们没有前缀/后缀逗号。

还有,由于变量将包含网页内容,因此不可避免的是,也可能会出现逗号紧接着空格,然后是另一个单词的情况,因为这是在段落和句子中使用逗号的正确方法......

例如:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

我对逗号后面跟空格的情况不感兴趣(如上所示)。

我只对逗号后面没有空格的情况感兴趣(即cat_dog、horse、rabbit、chicken-pig)

我已经尝试了很多方法来做这件事,但不能找出最好的方法来构造正则表达式。

pod7payv

pod7payv1#

不如

[^,\s]+(,[^,\s]+)+

其将匹配一个或多个不是空格或逗号的字符[^,\s]+,后面是逗号,以及一个或多个不是空格或逗号的字符,一次或多次。

  • 补充意见 *

要匹配多个序列,请添加g修饰符以进行全局匹配。
下面的代码在,上拆分每个匹配$&,并将结果推送到@matches

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));
}   

print join("\n",@matches),"\n";
fzwojiic

fzwojiic2#

虽然您可能可以构造单个正则表达式,但是正则表达式、split s、grepmap的组合看起来很不错

my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split

从右到左:
1.按空格拆分行(split
1.仅保留两端没有逗号但内部有逗号的元素(grep
1.将每个此类元素拆分为部分(mapsplit
这样,您就可以轻松地更改部件,例如删除grep中的两个连续逗号add && !/,,/

0tdrvxhp

0tdrvxhp3#

我希望这是明确的,并符合您的需要:

#!/usr/bin/perl
    use warnings;
    use strict;

    my @strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
    "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf", 
     "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew", 
     "Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
     "Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");

    my $regex = qr/
                \s #From your examples, it seems as if every
                   #comma separated list is preceded by a space.
                (
                    (?:
                        [^,\s]+ #Now, not a comma or a space for the
                                 #terms of the list

                        ,        #followed by a comma
                    )+
                    [^,\s]+     #followed by one last term of the list
                )
                /x;

    my @matches = map {
                    $_ =~ /$regex/;
                    if ($1) {
                        my $comma_sep_list = $1;
                        [split ',', $comma_sep_list];
                    }
                    else {
                        []
                    }
                } @strs;
zqry0prt

zqry0prt4#

$var =~ tr/ //s;    
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
      push (@arr, $&);
    }

正则表达式匹配三种情况:

(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,)      : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

相关问题