python 如何通过字典只运行文本文件的某些行,而保持其他一切相同

cotxawn7  于 2023-01-16  发布在  Python
关注(0)|答案(4)|浏览(186)

在我的计算生物学期末作业中,我需要取一个DNA序列,将其翻译成RNA,然后再将其翻译成蛋白质结构。下面是一个例子(2dna.fasta,现在正在运行我的代码)

>ENST00000632684.1
GGGACAGGGGGC
>ENST00000434970.2
CCTTCCTAC

任何以破折号开头的都是元数据,而其他的都是蛋白质序列。我可以这样做,以便每隔一行进行翻译,但对于最终项目的第二部分,文件看起来像这样

>ENST00000651352.1 cds chromosome:GRCh38:3:3126963:3148101:1 gene:ENSG00000072756.17 gene_biotype:protein_coding transcript_biotype:nonsense_mediated_decay gene_symbol:TRNT1 description:tRNA nucleotidyl transferase 1 [Source:HGNC Symbol;Acc:HGNC:17341]
ATGCTGAGGTGCCTGTATCATTGGCACAGGCCAGTGCTGAACCGTAGGTGGAGTAGGCTG
TGCCTTCCGAAGCAGTATCTATTCACAATGAAGTTGCAGTCTCCCGAATTCCAGTCACTT
TTCACAGAAGGACTGAAGAGTCTGACAGAATTATTTGTCAAAGAGAATCACGAATTAAGA
ATAGCAGGAGGAGCAGTGAGGGATTTATTAAATGGAGTAAAGCCTCAGGATATAGATTTT
GCCACCACTGCTACCCCTACTCAAATGAAGGAGATGTTTCAGTCGGCTGGGATTCGGATG
ATAAACAACAGAGGAGAAAAGCACGGAACAATTACTGCCAGGGTTTTGATGGCACTTTAT
TTGACTACTTTAATGGTTATGAAGATTTAA
>ENST00000434583.5 cds chromosome:GRCh38:3:3126965:3150879:1 gene:ENSG00000072756.17 gene_biotype:protein_coding transcript_biotype:nonsense_mediated_decay gene_symbol:TRNT1 description:tRNA nucleotidyl transferase 1 [Source:HGNC Symbol;Acc:HGNC:17341]

我最初的解决方案是删除以破折号开头的行,但这不包括我需要的元数据,是否有可能以某种方式将元数据从dna中分离出来,在字典中运行DNA数据,然后将dna放在元数据之间(就像它在字典中运行之前的位置一样)。
如上所述,我试过删除以'〉'开头的行,但这只有在我不需要 meta数据时才能起作用。我确实需要元数据。我也试过让它只读取以'ATG'开头的行,因为大多数DNA链都以ATG开头,但在项目的第二部分开始时,DNA在大约100行中不以ATG开头。

import sys
file = open('2dna.fasta' , 'r')

DNASequence = ''
for lines in file.readlines():
    if not (lines.startswith('>')):
        DNASequence = DNASequence +  lines 
    
DNASequence = DNASequence.replace('\n', '')
print('The original DNA sequence is', DNASequence)

CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
final = ""

for letter in DNASequence:    
    final += CompletmentDict[letter]
    
print ("Your completement is: ", final)

final2 = "" 
    
DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}

for letters in final:
    final2 +=  DNATORNADICT[letters]

print("Your Final DNA TO RNA TRANSCRIPTION IS: " + final2)

rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

final3 = ""

for p in range(0,len(final2),3):
    myKey = final2[p:p+3]
    final3 += rna2protein.get(myKey)
    
print("Resulting protein is: ", final3)

proteinSeq = open('proteinSeq.txt', 'w')
proteinSeq.write(final3)
proteinSeq.close()

其输出为

The original DNA sequence is GGGACAGGGGGCCCTTCCTAC
Your completement is:  CCCTGTCCCCCGGGAAGGATG
Your Final DNA TO RNA TRANSCRIPTION IS: GGGACAGGGGGCCCUUCCUAC
Resulting protein is:  GTGGPSY

在我的结果文件里

GTGGPSY

但我希望它就像

>ENST00000632684.1
GTGG
>ENST00000434970.2
PSY

我怎么能这样做呢?如果你需要任何澄清这意味着什么,让我知道

ttp71kqs

ttp71kqs1#

我对你的代码做了些调整:

import sys
file = open('2dna.fasta' , 'r')
proteinSeq = open('proteinSeq.txt', 'a')

metaData = []
DNAs = []

DNASequence = ''
for lines in file.readlines():
    if not (lines.startswith('>')):
        DNASequence = DNASequence +  lines 
    else:
        DNASequence = DNASequence.replace('\n', '')
        DNAs.append(DNASequence)
        metaData.append(lines.split(" ")[0])
        DNASequence = ""

def showData(metaData,DNASequence):
    print('The original DNA sequence is', DNASequence)

    CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
    final = ""

    for letter in DNASequence:    
        final += CompletmentDict[letter]
        
    print ("Your completement is: ", final)

    final2 = "" 
        
    DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}

    for letters in final:
        final2 +=  DNATORNADICT[letters]

    print("Your Final DNA TO RNA TRANSCRIPTION IS: " + final2)

    rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
    'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
    'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
    'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
    'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
    'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
    'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
    'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
    'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
    'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
    'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
    'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
    'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
    'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
    'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
    'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

    final3 = ""

    for p in range(0,len(final2),3):
        myKey = final2[p:p+3]
        final3 += rna2protein.get(myKey)
        
    print("Resulting protein is: ", final3)
    proteinSeq.write("\n " + metaData + "\n " + final3)

for i in range(len(DNAs)):
    showData(metaData[i],DNAs[i])

proteinSeq.close()
sq1bmfud

sq1bmfud2#

我没有任何领域的具体知识,你需要完成,所以让我知道,如果我误解了什么。
它看起来像是有一行元数据,后面跟着与之相关的DNA序列。如果是这样的话,我认为将这些拆分成单独的条目,您可以按顺序处理会很有帮助。
假设〉字符只出现在元数据行的前面,而不在文件的其他地方,则可以将其用作分隔符来拆分字符串:

my_sequences = []
with open('myfile.txt' , 'r') as file:
  # split along the metadata delimiter
  for entry in file.read().split('>')[1:]:
    # split the entry at the first newline character to separate the metadata and sequence
    [meta, seq] = (entry.split("\n", 1))
    my_sequences.append({"meta":meta, "seq": seq})

with open('proteinSeq.txt', 'w') as proteinSeq:
  for entry in my_sequences:
    DNASequence = entry["seq"].replace("\n", "")
    print('The original DNA sequence is', DNASequence)

    CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
    final = ""

    for letter in DNASequence:    
        final += CompletmentDict[letter]
    
    print ("Your completement is: ", final)

    final2 = "" 
        
    DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}

    for letters in final:
        final2 +=  DNATORNADICT[letters]

    print("Your Final DNA TO RNA TRANSCRIPTION IS: " + final2)

    rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
    'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
    'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
    'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
    'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
    'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
    'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
    'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
    'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
    'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
    'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
    'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
    'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
    'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
    'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
    'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

    final3 = ""

    for p in range(0,len(final2),3):
        myKey = final2[p:p+3]
        final3 += rna2protein.get(myKey)
        
    print("Resulting protein is: ", final3)
    proteinSeq.write(f""">{entry["meta"]}\n""")
    proteinSeq.write(f"{final3}\n")
  proteinSeq.close()

最后得到一个字典数组,可以循环遍历,其中每个字典都有一个单独的键值对,用于元数据和以下序列。

vyswwuz2

vyswwuz23#

如果我正确理解了你想要什么(基于你期望的输出),这就完成了任务:

  • 我把DNA代码转换打包成了一个函数。
  • 代码的主要部分检查元数据行;如果找到了,则它对先前的(如果有的话)DNA序列进行代码转换,并将蛋白质写入文件,然后写入元数据行,并前进到下一个DNA块。
def treat_DNA(seq):
        print('The original DNA sequence is', seq)
      
        CompletmentDict = {'A':'T', 'T':'A', 'G':'C', 'C' : 'G'}
        final = ""
        for letter in seq:    
            final += CompletmentDict[letter]  
        print ("Your completement is: ", final)
      
        final2 = ""   
        DNATORNADICT = {'A':'U', 'T':'A', 'G':'C', 'C' : 'G'}
        for letters in final:
            final2 +=  DNATORNADICT[letters]
        print("Your Final DNA TO RNA TRANSCRIPTION IS: " + final2)
      
        rna2protein = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L',
        'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S',
        'UAU':'Y', 'UAC':'Y', 'UAA':'', 'UAG':'',
        'UGU':'C', 'UGC':'C', 'UGA':'', 'UGG':'W',
        'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
        'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P',
        'CAU':'H', 'CAC':'H', 'CAA':'Q', 'CAG':'Q',
        'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R',
        'AUU':'I', 'AUC':'I', 'AUA':'I', 'AUG':'M',
        'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
        'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K',
        'AGU':'S', 'AGC':'S', 'AGA':'R', 'AGG':'R',
        'GUU':'V', 'GUC':'V', 'GUA':'V', 'GUG':'V',
        'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A',
        'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
        'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}
        final3 = ""
        for p in range(0,len(final2),3):
            myKey = final2[p:p+3]
            final3 += rna2protein.get(myKey)
        print("Resulting protein is: ", final3)
      
        with open('proteinSeq.txt', 'a') as file:
            file.write(final3+'\n')
    
    file = open('2dna.fasta' , 'r')
    
    DNASequence = ''
    for line in file.readlines():
        if line.startswith('>'):
            if DNASequence:
                treat_DNA(DNASequence)
            DNASequence = ''
            with open('proteinSeq.txt', 'a') as file:
                file.write(line)
        else:
            DNASequence += line.strip() 
    treat_DNA(DNASequence)
2ic8powd

2ic8powd4#

如果我没理解错的话,你可以保留两个版本的字符串,一个有 meta数据,一个没有(如果你真的需要一个没有),同时保留换行符“\n”。然后遍历每一行,检查下一个字符是否是“〈”,如果是,就添加没有字典的行,如果不是,遍历行中的每个字符。
另外,考虑给你的变量更好的名字;)
这与你所举的例子并不相符,但应该会引导你走向正确的方向:

# read through the dna string and replace each character with its rna counterpart
for line in dna.split("\n"):
    if not line.startswith(">"):
        for char in line:
            rna += DNATORNADICT[char]
        rna += "\n"
    else:
        rna += line
        rna += "\n"

protein = ""

# read through the rna string and replace each codon with its protein counterpart
for line in rna.split("\n"):
    if line.startswith(">"):
        protein += line + "\n"
    else:
        for i in range(0, len(line), 3):
            codon = line[i:i+3]
            protein += rna2protein.get(codon) # not sure why you use .get here
        protein += "\n"

print(protein)

相关问题