如果行满足某个范围,如何打印行

0pizxfdo  于 2021-07-14  发布在  Java
关注(0)|答案(3)|浏览(455)

我有两个大文件,如下所示:
f1:

  1. chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
  2. chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
  3. chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
  4. chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
  5. chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

f2层:

  1. chr,name,start,end
  2. chr1,linc1320,3073300,3074300
  3. chr3,linc2245,3077270,3078250
  4. chr1,linc8956,4410501,4406025

如果 start 以及 end file2的列在file1的范围内(列2和3)和 chr 是一样的。因此,根据我提供的虚拟示例文件,所需的输出应该是(只有 linc1320 在file1的第一行中,并且 linc2245 位于文件的第三行(1):

  1. chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
  2. chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
  3. chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
  4. chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
  5. chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

我不是一个专业的编码器,但我一直在使用此代码手动更改基于文件2的范围:

  1. awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv

我对使用特定的编程语言没有特别的偏好-两者都有 Python 以及 awk 会很有帮助的。谢谢你的帮助。

kmpatx3s

kmpatx3s1#

你可以用这个 awk :

  1. awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1
  2. chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
  3. chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
  4. chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
  5. chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
  6. chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"

更具可读性的形式:

  1. awk '
  2. BEGIN { FS = OFS = "," }
  3. FNR == NR {
  4. if (FNR > 1) {
  5. chr[++n] = $1
  6. id[n] = $2
  7. r1[n] = $3
  8. r2[n] = $4
  9. }
  10. next
  11. }
  12. {
  13. for (i=1; i<=n; ++i)
  14. if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
  15. $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
  16. break
  17. }
  18. } 1' file2 file1
展开查看全部
w8ntj3qf

w8ntj3qf2#

编辑:通过op的编辑输入,你可以尝试以下内容。即使文件2中的字段超过4个,也可以这样做。

  1. awk '
  2. BEGIN{
  3. FS=OFS=","
  4. }
  5. FNR==NR{
  6. start[++count]=$2
  7. end[count]=$3
  8. match($0,/,.*/)
  9. val[count]=substr($0,RSTART-1,RLENGTH-1)
  10. next
  11. }
  12. {
  13. for(i=1;i<=count;i++){
  14. if(start[i]>$2 && end[i]<$3){
  15. print $0 OFS val[i]
  16. next
  17. }
  18. }
  19. }
  20. 1' file2 file1

有了你们展示的样品,你们能试一下吗。用gnu编写和测试 awk ,应该在任何情况下工作 awk . 参考阿努巴瓦的回答。

  1. awk '
  2. BEGIN{
  3. FS=OFS=","
  4. }
  5. FNR==NR{
  6. start[++count]=$2
  7. end[count]=$3
  8. val[count]=$0
  9. next
  10. }
  11. {
  12. for(i=1;i<=count;i++){
  13. if(start[i]>$2 && end[i]<$3){
  14. print $0 OFS val[i]
  15. next
  16. }
  17. }
  18. }
  19. 1' file2 file1

说明:增加了对以上内容的详细说明。

  1. awk ' ##Starting awk program from here.
  2. BEGIN{ ##Starting BEGIN section of this program from here.
  3. FS=OFS="," ##Setting FS and OFS as comma here.
  4. }
  5. FNR==NR{ ##Checking condition which will be true when file2 is being read.
  6. start[++count]=$2 ##Creating start array with count variable as as index and has $2 value in it.
  7. end[count]=$3 ##Creating end array with count as index and value is $3.
  8. val[count]=$0 ##Creating val array with index of count and value as $0.
  9. next ##next will skip all further statements from here.
  10. }
  11. {
  12. for(i=1;i<=count;i++){ ##Running for loop till value of count here.
  13. if(start[i]>$2 && end[i]<$3){ ##Checking condition if start[i]>$2 AND end[i]<$3.
  14. print $0 OFS val[i] ##Then printing current line with OFS, val here.
  15. next ##next will skip all further statements from here.
  16. }
  17. }
  18. }
  19. 1 ##1 will print current line here.
  20. ' file2 file1 ##Mentioning Input_file names here.
展开查看全部
fdbelqdn

fdbelqdn3#

让我们试着解决这个问题 pandas 对了,先看一下 csv 将文件放入 pandas Dataframe

  1. f1 = pd.read_csv('file1.csv', header=None)
  2. f2 = pd.read_csv('file2.csv')
  3. >>> f1
  4. 0 1 2 3 4 5 6
  5. 0 chr1 3073253 3074322 gene_id ENSMUSG00000102693.1 gene_type TEC
  6. 1 chr1 3074253 3075322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1
  7. 2 chr1 3077253 3078322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1
  8. 3 chr1 3102916 3103025 gene_id ENSMUSG00000064842.1 gene_type snRNA
  9. 4 chr1 3105016 3106025 gene_id ENSMUSG00000064842.1 transcript_id ENSMUST00000082908.1
  10. >>> f2
  11. chr name start end
  12. 0 chr1 linc1320 3073300 3074300
  13. 1 chr3 linc2245 3077270 3078250
  14. 2 chr1 linc8956 4410501 4406025

现在我们可以了 merge 以及 filter 满足给定区间包含条件的行 join 与文件一起筛选的行 f1 ```
m = f1.reset_index()
.merge(f2, left_on=0, right_on='chr')
.where(lambda x: x[1].le(x['start']) & x[2].ge(x['end']))
.set_index('index')'name', 'start', 'end'

f3 = f1.join(m)

f3

  1. 0 1 2 3 4 5 6 name start end

0 chr1 3073253 3074322 gene_id ENSMUSG00000102693.1 gene_type TEC linc1320 3073300.0 3074300.0
1 chr1 3074253 3075322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1 NaN NaN NaN
2 chr1 3077253 3078322 gene_id ENSMUSG00000102693.1 transcript_id ENSMUST00000193812.1 NaN NaN NaN
3 chr1 3102916 3103025 gene_id ENSMUSG00000064842.1 gene_type snRNA NaN NaN NaN
4 chr1 3105016 3106025 gene_id ENSMUSG00000064842.1 transcript_id ENSMUST00000082908.1 NaN NaN NaN

  1. ps:您还可以保存生成的Dataframe `f3` 使用 `f3.to_csv('file3.csv')`
展开查看全部

相关问题