perl 换行分隔行,保留第一列,并使用最小最终长度

ccrfmcuu  于 2022-11-15  发布在  Perl
关注(0)|答案(7)|浏览(140)

希望将内容行分开,保留一个标题。
我做了大量的文本处理,我喜欢使用unix一行程序,因为随着时间的推移,它们对我来说很容易组织(相对于大量的脚本),我可以很容易地把它们链接在一起,我喜欢(重新)学习如何使用经典的unix函数。通常我会使用一个简短的awk,perl,或ruby一行程序,这取决于哪个是最优雅的。
这里我有X行逗号分隔的项目。我想把这些分开,保留标题。
输入:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab

输出:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

算法详细信息:

  • 输入行由标题词、等号和逗号分隔的至少1个项目的列表组成。
  • 在本例中,大多数单词都是单个单词,但单词可以包含空格(例如,结尾处的“horseshoe crab”)
  • 分割为9个项目,除非小于3个,在这种情况下,最终分割可能在一行上产生12个项目
  • 有多条线。例如,下一条线可能是行星。

我有一个想法,先跳过空格,然后使用unix fold,再用awk把第一列拉下来。

echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} $1==""{$1=p} {p=$1} 1' \
| tr '\t _'  '=, '

但它只考虑字符长度(而不是项计数),没有考虑我的特殊情况,我不希望〈3个项挂在最后一行。
我觉得这是一个优雅的小拼图,有想法吗?

dxxyhpgq

dxxyhpgq1#

使用Perl,单向

perl -wnE'
    ($head, @items) = split /\s*[,=]\s*/; 
    while (@items) { 
        @elems = splice @items, 0, 9;
        if (@elems < 3) { $lines[-1] .= ", " . join ", ", @elems }
        else            { push @lines, join ", ", @elems }
    }
    say "$head = $_" for @lines; @lines = ()
' file

perl -wnE'
    ($head, @items) = split /\s*[,=]\s*/; 
    push @lines, join ", ", splice @items, 0, 9  while @items; 
    $lines[-2] .= ", " . pop @lines  if 2 > $lines[-1] =~ tr/,//;
    say "$head = $_" for @lines; @lines = ()
' file

显示在多行上以提高可读性,可以复制粘贴到bash终端中,但也可以在一行上输入。测试时添加了11(9+2)个项目。
备注

  • split-ing通过,=首先提取中心词,然后提取行上的项
  • splice移除并返回(前9)个元素,由,连接的元素生成一个要打印的行,直到处理完所有元素。如果最后一组元素少于3个,则将其添加到上一个要打印的行
  • 在第二个版本中,所有的元素都被处理,然后通过计算要打印的最后一行中的逗号来检查它是否需要被添加到前一行中
eblbsuwk

eblbsuwk2#

您可以考虑以下awk

awk 'BEGIN {FS=OFS=" = "} {
   s = $2
   while (match(s, /([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/)) {
      v = substr(s, RSTART, RLENGTH)
      sub(/, $/, "", v)
      print $1, v
      s = substr(s, RLENGTH+1)
   }
}' file

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

请特别注意此处使用的正则表达式/([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/
它匹配1到9个用,分隔符分隔的单词。这个正则表达式也有一个可选部分,匹配行尾之前最多3个单词。

rqcrx0a6

rqcrx0a63#

仅使用您展示的示例,请尝试以下awk程序。在GNU awk中编写和测试,应该可以在任何awk中工作。
在这里我创建了一个名为numberOfFieldsawk变量,它包含了您要打印的字段的数量(按照所示的示例用新行分隔)。

awk  -v numberOfFields="9" '
BEGIN{
  FS=", ";OFS=", "
}
{
  line=$0
  sub(/ = .*/,"",line)
  sub(/^[^ ]* =[^ ]* /,"")
  for(i=1;i<=NF;i++){
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
    (i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{
  print ""
}
'  Input_file

***OR***以上代码将printf语句分为两行(出于可读性目的),如果您希望将其本身分为一行,请尝试以下操作:

awk  -v numberOfFields="9" '
BEGIN{
  FS=", ";OFS=", "
}
{
  line=$0
  sub(/ = .*/,"",line)
  sub(/^[^ ]* =[^ ]* /,"")
  for(i=1;i<=NF;i++){
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{
  print ""
}
'  Input_file

***说明:***添加上述详细说明。

awk  -v numberOfFields="9" '            ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{                                  ##Starting BEGIN section of awk here.
  FS=", ";OFS=", "                      ##Setting FS and OFS to comma space here.
}
{
  line=$0                               ##Setting value of $0 to line here.
  sub(/ = .*/,"",line)                  ##Substituting space = space everything till last of value in line with NULL.
  sub(/^[^ ]* =[^ ]* /,"")              ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
  for(i=1;i<=NF;i++){                   ##Running for loop here for all fields.
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\  ##Using printf and its conditions are explained below of code.
    (i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{                                    ##Starting END block of this program from here.
  print ""                              ##Printing newline here.
}
'  Input_file                           ##Mentioning Input_file name here.

上述printf条件的说明:

(
  i%numberOfFields==0                   ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
    ?OFS $i ORS line" = "               ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
    :(i==1                              ##If very first condition is FALSE then checking again if i==1
       ?line " = " $i                   ##Then print line variable followed by space = space followed by $i
       :(i%numberOfFields>1?OFS $i:$i)  ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
     )
)
ecbunoof

ecbunoof4#

一个awk创意:

awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i++) {
      if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
         if ( i > max ) print newline
         newline=$1 "="
         pfx=""
      }
      newline=newline pfx $i
      pfx=","
  }
  print newline
}
' raw.dat

示例数据:

$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

使用-v min=3 -v max=9,我们得到:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13

解决OP关于使用一行程序的评论...
虽然这个awk脚本肯定可以塞进一个单行程序,我猜OP会a)发现它很难编辑/维护和b)太容易搞砸,如果不得不(重新)键入一遍又一遍。
一个(显而易见的?)想法是将awk代码 Package 在函数中,例如:

splitme() {
    awk -F'[=,]' -v min=$1 -v max=$2 '
    { for (i=2; i<=NF; i++) {
          if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
             if ( i > max ) print newline
             newline=$1 "="
             pfx=""
          }
          newline=newline pfx $i
          pfx=","
      }
      print newline
    }' "${3:--}"
}

备注:

  • 参数化minmax值,以便从命令行提取
  • 参数化了要从命令行($3)或stdin(-)提取的文件引用
  • OP可以根据需要添加更多逻辑来验证/确认输入参数

是否对文件独立调用:

$ splitme 3 9 raw.dat

或者在管道中调用:

$ cat raw.dat | splitme 3 9

两者均生成:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
ggazkfy8

ggazkfy85#

awk -F"[=,]" -v max="9" '{
        for(i=2; i<=NF; i+=max){
                row = ""
                for(j=i; j<=i+max-1; j++){
                        row=row $(j) ","
                }
                gsub(/,+$/, "", row)
                printf "%s=%s \n", $1, row
        }
    }' input_file

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers = 10, 11, 12, 13, 14, 15, 16
cars = mercedes benz, bmw, audi, vw, porsche, seat, skoda, opel, renault
cars = mazda, toyota, honda
svmlkihl

svmlkihl6#

下面是两个处理一行代码的Ruby解决方案:变量str保存一行代码(示例中以'animals = ...'开头的代码)。

#1使用正则表达式

第一个
正则表达式可以用 * 自由间距模式 * 编写,使其具有自文档性。

RGX =
/
\A         # match beginning of string
\w+        # match one or more word chars (e.g., "animals")
|          # or
[ ]*=[ ]*  # "=" preceded and followed by zero or more spaces
|          # or         
(?:        # begin a non-capture group
  [^,]+    # match one or more chars other than a comma
  ,[ ]*    # match a comma and zero or more spaces
){0,10}    # end non-capture group and execute 0-10 times
[^,]+      # match one or more chars other than a comma
\z         # match end of string
|          # or
(?:        # begin a non-capture group
 [^,]+     # match one or more chars other than a comma
 ,[ ]*     # match a comma and zero or more spaces
){9}     # end non-capture group and execute 1-7 times
/x         # invoke free-spacing regex definition mode

Demo
当对示例str执行时,我们将发现以下内容。

headword
  #=> "animals"
_
  #=> "="
lines
  #=> ["lizard, bird, bee, snake, whale, eagle, beetle, ",
       "mule, hare, goose, horse, mouse, pig, dog, ",
       "frog, bug, fish, duck, camel, squirrel, owl, ",
       "chicken, pigeon, lion, sheep, bear, spider, deer, ",
       "tiger, lobster, dinosaur, cat, goat, rat, cricket, ",
       "rabbit, elephant, crow, fox, donkey, monkey, butterfly, ",
       "crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab"]

Ruby有一个惯例,就是在变量_的值没有被用于后续计算的情况下使用它,这主要是为了告诉读者。

#2提取并分组单词

一个
通过部分解释,我们将获得该示例的以下内容:

headword
  #=> "animals"
words
  #=> ["lizard", "bird",,..."horseshoe crab"]
groups
  #=> [["lizard", "bird", "bee", "snake", "whale", "eagle",
        "beetle", "mule", "hare"],
       ["goose", "horse", "mouse", "pig", "dog", "frog",
        "bug", "fish", "duck"],
       ["camel", "squirrel", "owl", "chicken", "pigeon", "lion",
        "sheep", "bear", "spider"],
       ["deer", "tiger", "lobster", "dinosaur", "cat", "goat",
        "rat", "cricket", "rabbit"],
       ["elephant", "crow", "fox", "donkey", "monkey", "butterfly",
        "crab", "leopard", "moth"],
       ["shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]]

由于groups的元素包含两个以上的元素(它包含五个),因此groups不会被修改。如果最后一行被允许最多包含14个(而不是11个)元素,它将被更改为

["elephant", "crow", "fox", "donkey", "monkey", "butterfly", "crab",
 "leopard", "moth", "shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]
uyto3xhc

uyto3xhc7#

花了一些时间修改我的解决方案,通过在正则表达式链的末尾执行与$1 = $1等效的操作,使其在gawkmawk上都能工作;
$(NF!=NF=NF)在内部扩展为NF != (NF=NF),它总是false,所以整个过程就是$0,但是在其中嵌入$1=$1

input ::

     1  animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
     2  planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

 command ::

 [mg]awk '
 BEGIN {
     FS = (OFS = " = ") "*" 
   _=__ = (___="[^,]+")"[,]"
           gsub(".",_,__)
     __ = (__)_ "(("_")?("_")?"___"$)?"
 
      _ = ORS } gsub(__,"&"_ $1 OFS)+gsub("[,]"_,_)+sub((_)"+([^,]*)$","", $(NF!=NF=NF))' 

 output (mawk 1.3.4) ::

     1  animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
     2  animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
     3  animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
     4  animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
     5  animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
     6  animals = shark, salmon, shrimp, mosquito, horseshoe crab
     7  planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

 output (gawk 5.1.1) ::

     1  animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
     2  animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
     3  animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
     4  animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
     5  animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
     6  animals = shark, salmon, shrimp, mosquito, horseshoe crab
     7  planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

相关问题