unix 列出一行中同时出现的所有名称对,并计算其出现频率

9jyewag0  于 2022-11-04  发布在  Unix
关注(0)|答案(3)|浏览(117)

我有以下文件(2016.csv,文件头如下所示)

Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi

我使用下面的awk循环来查找所有可能的名称对,这些名称对一起出现在上述文件的一行中。
Zhichen Gong , Huanhuan Chen, 1
此awk循环的输出如下:

Zhichen Gong , Huanhuan Chen 
Zhichuan Huang , Tiantian Xie
Zhichuan Huang , Ting Zhu
Zhichuan Huang , Jianwu Wang
Zhichuan Huang , Qingquan Zhang
Zhifei Zhang,Yang Song
Zhifei Zhang,Wei Wang 0063
Zhifei Zhang,Hairong Qi
etc

这个循环运行良好,并且找到了所有同时出现在初始文件的一行中的对。我唯一想添加的是在awk输出的每一行旁边添加一个计数器,它将显示这个对在初始文件中出现了多少次。
例如,对于上面的awk输出,我希望它像这样:

Zhichen Gong , Huanhuan Chen, 1
Zhichuan Huang , Tiantian Xie, 1
Zhichuan Huang , Ting Zhu, 2
Zhichuan Huang , Jianwu Wang, 1
Zhichuan Huang , Qingquan Zhang, 1
Zhifei Zhang,Yang Song, 1
Zhifei Zhang,Wei Wang 0063,1 
Zhifei Zhang,Hairong Qi,1

其中,第一行(Zhichen Gong , Huanhuan Chen, 1)中的1表示这对名称在初始文件中出现了1次。
我假设我只需要在awk循环中添加一个计数器,但是到目前为止我还不能正确地完成它。

gblwokeq

gblwokeq1#

使用OP的11行样本作为我们的输入:

$ cat 2016.csv
Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi
Zhihao Huang,Hui Li,Xin Li,Wei He
Zhijun Yin,You Chen,Daniel Fabbri,Jimeng Sun,Bradley A. Malin
Zhipeng Huang 0001,Bogdan Cautis,Reynold Cheng,Yudian Zheng
Zhipeng Huang 0001,Yudian Zheng,Reynold Cheng,Yizhou Sun,Nikos Mamoulis,Xiang Li 0067
Zhiqiang Tao,Hongfu Liu,Sheng Li 0001,Yun Fu 0001
Zhiqiang Xu,Yiping Ke
Zhiyuan Chen 0001,Estevam R. Hruschka Jr.,Bing Liu 0001

对OP的当前代码进行一些调整,以跟踪计数,然后首先按计数然后按名称对输出进行排序:

awk '
BEGIN { FS=","; OFS=" , " }
      { for (i=1;i<NF;i++)
            for(j=i+1;j<=NF;j++)
                if   ($i > $j) k[$i][$j]++            # increment counter
                else           k[$j][$i]++            # increment counter
      }
END   { # to sort by count we will create a new 3-dimensional array with the count as the 1st dimension
        for (i in k)
            for (j in k[i]) {
                arr[k[i][j]][i][j]                    # arr[count][i][j]
                delete k[i][j]                        # delete old array entry to limit memory usage
            }
        PROCINFO["sorted_in"]="@ind_num_desc"         # sort 1st index by count/descending
        for (cnt in arr) {
            PROCINFO["sorted_in"]="@ind_str_asc"      # sort 2nd/3rd indices by name/ascending
            for (i in arr[cnt])
                for (j in arr[cnt][i])
                    print i,j,cnt
        }
      }
' 2016.csv

备注:

  • 假设我们有足够的内存来存储三维数组;再说一次。
  • 这些2-/3维数组的内存使用量应该比其他使用1维数组复合索引的答案小得多,即...
  • [bob][smith][bob][jones]将要求bob在存储器中存储一次,而[bob,smith][bob,jones]将要求bob在存储器中存储两次
  • OP的预期输出混合了输出分隔符; OFS=" , "的使用与OP的早期编辑相匹配; OP可以根据需要修改OFS

这将生成以下61行的输出:

Yudian Zheng , Reynold Cheng , 2
Zhichuan Huang , Ting Zhu , 2
Zhipeng Huang 0001 , Reynold Cheng , 2
Zhipeng Huang 0001 , Yudian Zheng , 2
Daniel Fabbri , Bradley A. Malin , 1
Estevam R. Hruschka Jr. , Bing Liu 0001 , 1
Jimeng Sun , Bradley A. Malin , 1
Jimeng Sun , Daniel Fabbri , 1
Qingquan Zhang , Jianwu Wang , 1
Reynold Cheng , Bogdan Cautis , 1
Reynold Cheng , Nikos Mamoulis , 1
Sheng Li 0001 , Hongfu Liu , 1
Tiantian Xie , Jianwu Wang , 1
Tiantian Xie , Qingquan Zhang , 1
Ting Zhu , Jianwu Wang , 1
Ting Zhu , Qingquan Zhang , 1
Ting Zhu , Tiantian Xie , 1
Wei He , Hui Li , 1
Wei Wang 0063 , Hairong Qi , 1
Xiang Li 0067 , Nikos Mamoulis , 1
Xiang Li 0067 , Reynold Cheng , 1
Xin Li , Hui Li , 1
Xin Li , Wei He , 1
Yang Song , Hairong Qi , 1
Yang Song , Wei Wang 0063 , 1
Yizhou Sun , Nikos Mamoulis , 1
Yizhou Sun , Reynold Cheng , 1
Yizhou Sun , Xiang Li 0067 , 1
You Chen , Bradley A. Malin , 1
You Chen , Daniel Fabbri , 1
You Chen , Jimeng Sun , 1
Yudian Zheng , Bogdan Cautis , 1
Yudian Zheng , Nikos Mamoulis , 1
Yudian Zheng , Xiang Li 0067 , 1
Yudian Zheng , Yizhou Sun , 1
Yun Fu 0001 , Hongfu Liu , 1
Yun Fu 0001 , Sheng Li 0001 , 1
Zhichen Gong , Huanhuan Chen , 1
Zhichuan Huang , Jianwu Wang , 1
Zhichuan Huang , Qingquan Zhang , 1
Zhichuan Huang , Tiantian Xie , 1
Zhifei Zhang , Hairong Qi , 1
Zhifei Zhang , Wei Wang 0063 , 1
Zhifei Zhang , Yang Song , 1
Zhihao Huang , Hui Li , 1
Zhihao Huang , Wei He , 1
Zhihao Huang , Xin Li , 1
Zhijun Yin , Bradley A. Malin , 1
Zhijun Yin , Daniel Fabbri , 1
Zhijun Yin , Jimeng Sun , 1
Zhijun Yin , You Chen , 1
Zhipeng Huang 0001 , Bogdan Cautis , 1
Zhipeng Huang 0001 , Nikos Mamoulis , 1
Zhipeng Huang 0001 , Xiang Li 0067 , 1
Zhipeng Huang 0001 , Yizhou Sun , 1
Zhiqiang Tao , Hongfu Liu , 1
Zhiqiang Tao , Sheng Li 0001 , 1
Zhiqiang Tao , Yun Fu 0001 , 1
Zhiqiang Xu , Yiping Ke , 1
Zhiyuan Chen 0001 , Bing Liu 0001 , 1
Zhiyuan Chen 0001 , Estevam R. Hruschka Jr. , 1

如果输出的顺序无关紧要,则END{...}块可以简化为:

END   { for (i in k)
            for (j in k[i])
                print i,j,k[i][j]
      }
kx5bkwkv

kx5bkwkv2#

查找与其计数一起出现的所有可能的名称对
您可以使用此awk解决方案:

awk -F, -v OFS=" , " '
{
   for (i=1; i<NF; i++)
      ++fq[$i OFS $(i+1)]
}
END {
   for (i in fq) print i, fq[i]
}' file
v6ylcynt

v6ylcynt3#

使用一个合理的示例输入文件,这样我们就可以一眼看出脚本是否工作,因为预期的输出是显而易见的:

$ cat file
a,b,c
c,a
e,d

这将使用任何awk执行您想要的操作:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    for (i=1; i<NF; i++) {
        for (j=i+1; j<=NF; j++) {
            cnt[( $i < $j ? $i FS $j : $j FS $i )]++
        }
    }
}
END {
    for ( pair in cnt ) {
        print pair, cnt[pair]
    }
}
$ awk -f tst.awk file
a,b,1
a,c,2
d,e,1
b,c,1

或者,如果您要排序它:

$ awk -f tst.awk file | sort
a,b,1
a,c,2
b,c,1
d,e,1

通过OP提供的样本输入:

$ cat file2
Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi

我们有:

$ awk -f tst.awk file2 | sort
Hairong Qi,Wei Wang 0063,1
Hairong Qi,Yang Song,1
Hairong Qi,Zhifei Zhang,1
Huanhuan Chen,Zhichen Gong,1
Jianwu Wang,Qingquan Zhang,1
Jianwu Wang,Tiantian Xie,1
Jianwu Wang,Ting Zhu,1
Jianwu Wang,Zhichuan Huang,1
Qingquan Zhang,Tiantian Xie,1
Qingquan Zhang,Ting Zhu,1
Qingquan Zhang,Zhichuan Huang,1
Tiantian Xie,Ting Zhu,1
Tiantian Xie,Zhichuan Huang,1
Ting Zhu,Zhichuan Huang,2
Wei Wang 0063,Yang Song,1
Wei Wang 0063,Zhifei Zhang,1
Yang Song,Zhifei Zhang,1

相关问题