如果行满足某个范围,如何打印行

0pizxfdo  于 2021-07-14  发布在  Java
关注(0)|答案(3)|浏览(423)

我有两个大文件,如下所示:
f1:

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

f2层:

chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025

如果 start 以及 end file2的列在file1的范围内(列2和3)和 chr 是一样的。因此,根据我提供的虚拟示例文件,所需的输出应该是(只有 linc1320 在file1的第一行中,并且 linc2245 位于文件的第三行(1):

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

我不是一个专业的编码器,但我一直在使用此代码手动更改基于文件2的范围:

awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv

我对使用特定的编程语言没有特别的偏好-两者都有 Python 以及 awk 会很有帮助的。谢谢你的帮助。

kmpatx3s

kmpatx3s1#

你可以用这个 awk :

awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

更具可读性的形式:

awk '
BEGIN { FS = OFS = "," }
FNR == NR {
   if (FNR > 1) {
      chr[++n] = $1
      id[n] = $2
      r1[n] = $3
      r2[n] = $4
   }
   next
}
{
   for (i=1; i<=n; ++i)
      if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
         $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
         break
      }
} 1' file2 file1
w8ntj3qf

w8ntj3qf2#

编辑:通过op的编辑输入,你可以尝试以下内容。即使文件2中的字段超过4个,也可以这样做。

awk '
BEGIN{
  FS=OFS=","
}
FNR==NR{
  start[++count]=$2
  end[count]=$3
  match($0,/,.*/)
  val[count]=substr($0,RSTART-1,RLENGTH-1)
  next
}
{
  for(i=1;i<=count;i++){
    if(start[i]>$2 && end[i]<$3){
      print $0 OFS val[i]
      next
    }
  }
}
1' file2 file1

有了你们展示的样品,你们能试一下吗。用gnu编写和测试 awk ,应该在任何情况下工作 awk . 参考阿努巴瓦的回答。

awk '
BEGIN{
  FS=OFS=","
}
FNR==NR{
  start[++count]=$2
  end[count]=$3
  val[count]=$0
  next
}
{
  for(i=1;i<=count;i++){
    if(start[i]>$2 && end[i]<$3){
      print $0 OFS val[i]
      next
    }
  }
}
1' file2 file1

说明:增加了对以上内容的详细说明。

awk '                                  ##Starting awk program from here.
BEGIN{                                 ##Starting BEGIN section of this program from here.
  FS=OFS=","                           ##Setting FS and OFS as comma here. 
}
FNR==NR{                               ##Checking condition which will be true when file2 is being read.
  start[++count]=$2                    ##Creating start array with count variable as as index and has $2 value in it.
  end[count]=$3                        ##Creating end array with count as index and value is $3.
  val[count]=$0                        ##Creating val array with index of count and value as $0.
  next                                 ##next will skip all further statements from here.
}
{
  for(i=1;i<=count;i++){               ##Running for loop till value of count here.
    if(start[i]>$2 && end[i]<$3){      ##Checking condition if start[i]>$2 AND end[i]<$3.
      print $0 OFS val[i]              ##Then printing current line with OFS, val here.
      next                             ##next will skip all further statements from here.
    }
  }
}
1                                      ##1 will print current line here.
' file2 file1                          ##Mentioning Input_file names here.
fdbelqdn

fdbelqdn3#

让我们试着解决这个问题 pandas 对了,先看一下 csv 将文件放入 pandas Dataframe

f1 = pd.read_csv('file1.csv', header=None)
f2 = pd.read_csv('file2.csv')

>>> f1

      0        1        2        3                     4              5                     6
0  chr1  3073253  3074322  gene_id  ENSMUSG00000102693.1      gene_type                   TEC
1  chr1  3074253  3075322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
2  chr1  3077253  3078322  gene_id  ENSMUSG00000102693.1  transcript_id  ENSMUST00000193812.1
3  chr1  3102916  3103025  gene_id  ENSMUSG00000064842.1      gene_type                 snRNA
4  chr1  3105016  3106025  gene_id  ENSMUSG00000064842.1  transcript_id  ENSMUST00000082908.1

>>> f2

    chr      name    start      end
0  chr1  linc1320  3073300  3074300
1  chr3  linc2245  3077270  3078250
2  chr1  linc8956  4410501  4406025

现在我们可以了 merge 以及 filter 满足给定区间包含条件的行 join 与文件一起筛选的行 f1 ```
m = f1.reset_index()
.merge(f2, left_on=0, right_on='chr')
.where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))
.set_index('index')'name', 'start', 'end'

f3 = f1.join(m)

f3

  0        1        2        3                     4              5                     6      name      start        end

0 chr1 3073253 3074322 gene_id ENSMUSG00000102693.1 gene_type TEC linc1320 3073300.0 3074300.0
1 chr1 3074253 3075322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1 NaN NaN NaN
2 chr1 3077253 3078322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1 NaN NaN NaN
3 chr1 3102916 3103025 gene_id ENSMUSG00000064842.1 gene_type snRNA NaN NaN NaN
4 chr1 3105016 3106025 gene_id ENSMUSG00000064842.1 transcript_id ENSMUST00000082908.1 NaN NaN NaN

ps:您还可以保存生成的Dataframe `f3` 使用 `f3.to_csv('file3.csv')` 

相关问题