VLOOKUP使用AWK在CSV文件中追加多个值作为新列

rqcrx0a6  于 2023-09-28  发布在  其他
关注(0)|答案(4)|浏览(99)

我有两个不超过1 GB的头的CSV文件。举例来说:

File1.csv,4列:

1,75.10,"1","1"
2,12.01,"0","4"
4,26.57,"1","1"
3,100.94,"1","4"

File2.csv,4列:

1,"D",0,5
1,"F",0,5
1,"N",1,3
2,"D",1,8
2,"F",0,6
2,"N",3,0
4,"I",,5
4,"F",2,9

只要File2.csv的第1列和File1.csv的第1列匹配,我就想将File1.csv的第2、3和4列追加到File2.csv -以创建具有7列的New.csv
预期输出(New.csv):

1,"D",0,5,75.10,"1","1"
1,"F",0,5,75.10,"1","1"
1,"N",1,3,75.10,"1","1"
2,"D",1,8,12.01,"0","4"
2,"F",0,6,12.01,"0","4"
2,"N",3,0,12.01,"0","4"
4,"I",,5,26.57,"1","1"
4,"F",2,9,26.57,"1","1"

解释每一步背后的概念对我很有帮助,因为到目前为止我都是自学成才的。
我对AWK非常陌生,到目前为止还很难找到解决方案。我会使用Excel,但我的文件太大了,我想继续学习AWK。
下面是我通过试验/错误得到的,我只将File1.csv的第二列追加到File2.csv的末尾。我不明白三进制部分是如何工作的,所以我真的很感激你的解释。

awk -F "," 'FNR==NR {a[$1]=$2; next} {print $0","(($1 in a)?a[$1]:"NA")}' File1.csv File2.csv > New.csv
tzcvj98z

tzcvj98z1#

由于您的文件很大,如果它们还没有排序,最好是先对它们进行排序(GNU sort实用程序足够智能,可以在内存有限的计算机上对大文件进行排序)。一旦文件被排序,你可以使用join(输出前缀为-|):

join -t, <(sort -nst, -k1,1 File2.csv) <(sort -nst, -k1,1 File1.csv)
-| 1,"D",0,5,75.10,"1","1"
-| 1,"F",0,5,75.10,"1","1"
-| 1,"N",1,3,75.10,"1","1"
-| 2,"D",1,8,12.01,"0","4"
-| 2,"F",0,6,12.01,"0","4"
-| 2,"N",3,0,12.01,"0","4"
-| 4,"I",,5,26.57,"1","1"
-| 4,"F",2,9,26.57,"1","1"

如果您绝对需要awk,那么一旦文件被排序,您将能够并行读取它们(getline),并避免在内存中存储大量数据。示例(使用GNU awk和macOS附带的awk进行测试,输出前缀为-|):

sort -nst, -k1,1 File1.csv > File1.sorted.csv
sort -nst, -k1,1 File2.csv | awk -v r=1 '
  {
    f2 = t2 = $0; sub(/,.*/, "", f2); sub(/[^,]*,/, ",", t2)
    while(r == 1 && f1 < f2 && (r = (getline < "File1.sorted.csv")) == 1) {
      f1 = t1 = $0; sub(/,.*/, "", f1); sub(/[^,]*,/, ",", t1)
    }
    print f2 t2 ((f1 == f2) ? t1 : "")
  }'
-| 1,"D",0,5,75.10,"1","1"
-| 1,"F",0,5,75.10,"1","1"
-| 1,"N",1,3,75.10,"1","1"
-| 2,"D",1,8,12.01,"0","4"
-| 2,"F",0,6,12.01,"0","4"
-| 2,"N",3,0,12.01,"0","4"
-| 4,"I",,5,26.57,"1","1"
-| 4,"F",2,9,26.57,"1","1"
  • 读取排序后的File2.csv的下一行,将第一个字段(第一个逗号之前)存储在f2中,将该行的其余部分(从第一个逗号开始)存储在t2中。
  • f1(初始化时为0)严格小于f2时,读取下一行已排序的File1.csv。如果读取成功,则将第一个字段存储在f1中,并将该行的其余部分存储在t1中。
  • 打印File2.csv的当前行,如果前两个字段匹配,则附加File1.csvt1)行的尾部。

注意:我们将getline结果存储在变量r(初始化为1)中,如果它不等于1,则跳过后续调用,以避免阅读超过文件末尾,这是低效的。

dl5txlt9

dl5txlt92#

这就是问题所在:

FNR==NR {a[$1]=$2; next}

File1中的记录具有2个以上的字段。
我会这样写:

awk -F, -v OFS=, '
  FNR == NR {t=$0; sub(/^[^,]+,/, "", t); f1[$1] = t; next}
  !($1 in f1) {f1[$1] = "NA"}
  {print $0, f1[$1]}
' File{1,2}.csv

我让t变量保存整个记录,然后删除第一个字段,让该变量保存所有其他字段,这些字段仍然用逗号连接。

njthzxwz

njthzxwz3#

鼓励的意见,为改进我的解决方案,我能够找到使用提示,在评论到目前为止。这也处理不匹配。

awk -F "," 'FNR==NR {a[$1]=$2","$3","$4; next} {print $0","(($1 in a)?a[$1]:"NA")}' File1.csv File2.csv > New.csv
628mspwn

628mspwn4#

mawk 'FNR < NR ? ORS = "," __[+$_] RS : NF = _*(__[+$_] = $NF)' FS='^[^,]+,' \
 \
<( echo '1,75.10,"1","1"
         2,12.01,"0","4"                       
         4,26.57,"1","1"                       
         3,100.94,"1","4"' ) <( echo '

1,"D",0,5
1,"F",0,5
1,"N",1,3
2,"D",1,8
2,"F",0,6
2,"N",3,0   
4,"I",,5    
4,"F",2,9' )
1,"D",0,5,75.10,"1","1"
1,"F",0,5,75.10,"1","1"
1,"N",1,3,75.10,"1","1"
2,"D",1,8,12.01,"0","4"
2,"F",0,6,12.01,"0","4"
2,"N",3,0,12.01,"0","4"
4,"I",,5,26.57,"1","1"
4,"F",2,9,26.57,"1","1"

相关问题