问题描述
我有这个问题,我们需要编写一个代码,它需要一个蛋白质 fasta 文件和蛋白质序列标识符,并计算 fasta 文件中序列的所有可能的 RNA 组合,条件是组合总数应该小于 5000。
我首先制作了一个 RNA 密码子字典,然后我制作了一个函数,将 fasta 文件的元素(氨基酸)放入一个列表中,然后我尝试从该列表中进行组合,但是我得到一个错误,我尝试过但不知道问题出在哪里,如果有人可以检查代码并告诉我出了什么问题,我将不胜感激。
import itertools
from Bio import SeqIO
RNA_codon_table = {
'A': ('GCU','GCC','GCA','GCG'),'C': ('UGU','UGC'),'D': ('GAU','GAC'),'E': ('GAA','GAG'),'F': ('UUU','UUC'),'G': ('GGU','GGC','GGA','GGG'),'H': ('CAU','CAC'),'I': ('AUU','AUC','AUA'),'K': ('AAA','AAG'),'L': ('UUA','UUG','CUU','CUC','cua','CUG'),'M': ('AUG',),'N': ('AAU','AAC'),'P': ('CCU','CCC','CCA','CCG'),'Q': ('CAA','CAG'),'R': ('CGU','CGC','CGA','CGG','AGA','AGG'),'S': ('UCU','UCC','UCA','UCG','AGU','AGC'),'T': ('ACU','ACC','ACA','ACG'),'V': ('GUU','GUC','GUA','GUG'),'W': ('UGG','Y': ('UAU','UAC'),}
def protein_fasta (protein_file):
protein_sequence = []
protein = SeqIO.parse(protein_file,format = 'fasta')
for Seqrecord in protein:
protein_sequence.append(Seqrecord.seq)
print (protein_sequence)
for seq in protein_sequence:
codons = [ list(RNA_codon_table[key]) for key in protein_sequence ]
print(list(itertools.product(codons)))
对不起,我不知道如何附加 fasta 文件,但这是里面的序列:
>seq_compl complete sequence
IEEATHMTPCYELHglrWVQIQDYAINVMQCL
这是我得到的错误:
KeyError Traceback (most recent call last)
<ipython-input-65-3dd46947c505> in <module>
----> 1 all_combinations ('short_protein.fasta')
<ipython-input-64-45a50fffc1d9> in all_combinations(protein_file)
5 protein_sequence.append(Seqrecord.seq)
6
----> 7 codons = [ list(RNA_codon_table[key]) for key in protein_sequence
]
8 print(list(itertools.product(codons)))
<ipython-input-64-45a50fffc1d9> in <listcomp>(.0)
5 protein_sequence.append(Seqrecord.seq)
6
----> 7 codons = [ list(RNA_codon_table[key]) for key in protein_sequence
]
8 print(list(itertools.product(codons)))
KeyError: Seq('IEEATHMTPCYELHglrWVQIQDYAINVMQCL')
解决方法
根据您的示例,protein_sequence 变量当前仅在 protein_fasta 函数的局部范围内声明。
您需要先将此函数的结果分配给一个变量,然后才能对其进行迭代。
例如,将您的印刷品切换为退货:
def protein_fasta (protein_file):
protein_sequence = []
protein = SeqIO.parse(protein_file,format = 'fasta')
for Seqrecord in protein:
protein_sequence.append(Seqrecord.seq)
return protein_sequence
并确保调用并分配函数的结果:
protein_sequence = protein_fasta(protein_file)
现在你有了可以迭代的东西。
我发现你的 for 循环还有一个问题。你没有对 seq 做任何事情。在这种情况下,可能应该将 protein_sequence 换成 seq。我还取出了包装 RNA_codon_table 的列表,因为我认为在这种情况下不需要它:
for seq in protein_sequence:
codons = [ RNA_codon_table[key] for key in seq ]
print(list(itertools.product(*codons)))
,
您的蛋白质串将产生数十亿种组合:
from itertools import product,islice
def protGen(proteins):
for codons in product(*(RNA_codon_table[P] for P in proteins)):
yield "".join(codons)
计数组合:
proteins = "IEEATHMTPCYELHGLRWVQIQDYAINVMQCL"
count = 1
for P in proteins: count *= len(RNA_codon_table[P])
print(count) # 37,572,373,905,408 combinations
输出:
for protSeq in islice(protGen(proteins),500): # first 500
print(protSeq)
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUA