从Fasta文件的标头解析特定的字符串

问题描述

我希望从fasta头文件中获取生物体名称,我感兴趣的是从描述中提取 OS =(Organism Name) 时的生物体。

FASTA HEADER
>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1
MPICEFSATSKSRKIDVHAHVLPKNIPDFQEKFGYPGFVRLDHKEDGTTHMVKDGKLFRV
VEPNCFDTETRIADMNRANVNVQCLSTVPVMFSYWAKPADTEIVARFVNDDLLAECQKFP
GKEHIVLGTDYPFPLGEL
EVGRVVEEYKPFSAKDREDLLWKNAVKMLDIDENLLFNKDF
>sp|P34455|ACON_CAEEL Probable aconitate hydratase,mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2
MNSLLRLSHLAGPAHYRALHSSSSIWSKVAISKFEPKSYLPYEKLSQTVKIVKDRLKRPL
TLSEKILYGHLDQPKTQDIERGVSYLRLRPDRVAMQDATAQMAMLQFISSGLPKTAVPST
IHCDHLIEAQKGGAQDLARAKDLNKEVFNFLATAGSKYGVGFWKPGSGIIHQIILENYAF
获取FastaHeader的代码
from Bio import SeqIO
import re
import pandas as pd


input_file = "ANIMAL.fasta" 

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    fasta_id,sequence = fasta.id,str(fasta.seq)
    print(fasta.description)

当前输出:

>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1

>sp|P34455|ACON_CAEEL Probable aconitate hydratase,mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2

所需的输出:

Caenorhabditis elegans
Caenorhabditis elegans

解决方法

您可以使用正则表达式搜索您的信息:

import re
example = "sp|P34455|ACON_CAEEL Probable aconitate hydratase,mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2"

start = re.search("OS",example).start()
result = example[start+3:].split("GN")[0].strip()
print(result)
>> Caenorhabditis elegans

此代码在“ OS =“之后的文本中查找文本,直到“ GN”,并在末尾删除空白

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...