问题描述
sequence_protein = 'IEEATHMTPCYELHglrWVQIQDYAINVMQCL'
以及每个蛋白质的 tRNA 密码子表:
codon_table = {
'A': ('GCT','GCC','GCA','GCG'),'C': ('TGT','TGC'),'D': ('GAT','GAC'),'E': ('GAA','GAG'),'F': ('TTT','TTC'),'G': ('GGT','GGC','GGA','GGG'),'H': ('CAT','CAC'),'I': ('ATT','ATC','ATA'),'K': ('AAA','AAG'),'L': ('TTA','TTG','CTT','CTC','CTA','CTG'),'M': ('ATG',),'N': ('AAT','AAC'),'P': ('CCT','CCC','CCA','CCG'),'Q': ('CAA','CAG'),'R': ('CGT','CGC','CGA','CGG','AGA','AGG'),'S': ('TCT','TCC','TCA','TCG','AGT','AGC'),'T': ('ACT','ACC','ACA','ACG'),'V': ('GTT','GTC','GTA','GTG'),'W': ('TGG','Y': ('TAT','TAC'),}
然后我编写了一个函数,该函数将为每个蛋白质提供一个包含可能密码子的元组:
tRNA = []
for i in sequence_protein:
for residue in i:
tRNA.append(codon_table[residue])
给出了这个输出:
[('ATT',('GAA',('GCT',('ACT',('CAT',('ATG',('CCT',('TGT',('TAT',('TTA',('GGT',('CGT',('TGG',('GTT',('CAA',('ATT',('GAT',('AAT','CTG')]
有没有办法计算序列的所有可能的密码子组合(基本上计算元组中所有单独元素的乘积)? 还要计算没有先生成序列的产品数量?
我尝试使用产品功能,但我的笔记本崩溃了:s
combs = []
for a in product(*tRNA):
combs.append(a)
print(a)
解决方法
计算组合总数:
sequence_protein = 'IEEATHMTPCYELHGLRWVQIQDYAINVMQCL'
total_number_combinations = np.prod([ len(codon_table[aa]) for aa in sequence_protein ])
要生成所有可能的组合:
最优雅的是itertools:
from itertools import product
tRNA = [codon_table[aa] for aa in sequence_protein]
for i in product(*tRNA):
#...do whatever you have to do with these combinations.
但是您可以使用自定义函数。只需使用 yield
这样您就不会一次生成所有序列并避免内存问题。
import itertools
list_codons = [('ATT','ATC','ATA'),('GAA','GAG'),('GCT','GCC','GCA','GCG'),('ACT','ACC','ACA','ACG'),('CAT','CAC'),('ATG',),('CCT','CCC','CCA','CCG'),('TGT','TGC'),('TAT','TAC'),('TTA','TTG','CTT','CTC','CTA','CTG'),('GGT','GGC','GGA','GGG'),('CGT','CGC','CGA','CGG','AGA','AGG'),('TGG',('GTT','GTC','GTA','GTG'),('CAA','CAG'),('ATT',('GAT','GAC'),('AAT','AAC'),'CTG')]
counter = 0; max_proc = 1000000; list_seq = []
for x in itertools.product(*list_codons):
counter += 1
if counter % max_proc == 0:
#Do your stuff by slice and clear the list
list_seq = []
list_seq.append(x)
print (counter)
print (x)
就是这样,没有更多的内存问题