问题描述
|
我试图用Python编码蛋白质序列的理论胰蛋白酶解理。胰蛋白酶的切割规则是:在R或K之后,但不在P之前(即,胰蛋白酶在每个K或R之后切割(切割)蛋白质序列,除非(K或R)后接P)。
示例:切割(切割)序列“ 0”应产生以下4个序列(肽):
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
注意,在第二个肽中的K之后没有切割(因为P在K之后),并且在第三个肽中的R之后没有切割(因为P在R之后)。
我已经用Python编写了这段代码,但是效果不佳。有什么方法可以更有意义地实现此正则表达式?
# Open the file and read it line by line.
myprotein = open(raw_input(\'Enter input filename: \'),\'r\')
if os.path.exists(\"trypsin_digest.txt\"):
os.remove(\"trypsin_digest.txt\")
outfile = open(\"trypsin_digest.txt\",\'w+\')
for line in myprotein:
protein = line.rstrip()
protein = re.sub(\'(?<=[RK])(?=[^P])\',\'\',protein)
for peptide in protein:
outfile.write(peptide)
print \'results written to:\\n\',os.getcwd() +\'\\ trypsin_digest.txt\'
这就是我如何为我工作
myprotein = open(raw_input(\'Enter input filename: \'),\'r\')
my_protein = []
for protein in myprotein:
myprotein = protein.rstrip(\'\\n\')
my_protein.append(myprotein)
my_pro = (\'\'.join(my_protein))
#cleaves sequence
peptides = re.sub(r\'(?<=[RK])(?=[^P])\',\'\\n\',my_pro)
print peptides
蛋白质序列:
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
输出(胰蛋白酶切割位点)或肽
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
埃塞斯克
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
埃塞斯克
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
埃塞斯克
解决方法
正则表达式很好,但这是使用常规python的解决方案。既然你
在碱基中寻找子序列,将其构建为生成器是有意义的,
产生碎片。
example = \'MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK\'
def trypsin(bases):
sub = \'\'
while bases:
k,r = bases.find(\'K\'),bases.find(\'R\')
cut = min(k,r)+1 if k > 0 and r > 0 else max(k,r)+1
sub += bases[:cut]
bases = bases[cut:]
if not bases or bases[0] != \'P\':
yield sub
sub = \'\'
print list(trypsin(example))
, 编辑稍作修改,您的正则表达式就可以正常工作:
在您的评论中,您提到了一个文件中有多个序列(我们将其称为sequence.dat):
$ cat sequences.dat
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
>>> with open(\'sequences.dat\') as f:
s = f.read()
>>> print(s)
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
>>> protein = re.sub(r\'(?<=[RK])(?=[^P])\',\'\\n\',s,re.DOTALL)
>>> protein.split()
[\'MVPPPPSR\',\'GGAAKPGQLGR\',\'SLGPLLLLLRPEEPEDGDR\',\'EICSESK\',\'MVPPPPSR\',\'EICSESK\']
>>> print protein
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK
, 我相信以下正则表达式将按照您的描述进行操作:
([KR]?[^P].*?[KR](?!P))
以下来自pythonregexp的结果
>>> regex = re.compile(\"([KR]?[^P].*?[KR](?!P))\")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xb1a9f49eb4111980>
>>> regex.match(string)
<_sre.SRE_Match object at 0xb1a9f49eb4102980>
# List the groups found
>>> r.groups()
(u\'MVPPPPSR\',)
# List the named dictionary objects found
>>> r.groupdict()
{}
# Run findall
>>> regex.findall(string)
[u\'MVPPPPSR\',u\'GGAAKPGQLGR\',u\'SLGPLLLLLRPEEPEDGDR\',u\'EICSESK\']