从Fasta文件中删除标头

问题描述

我有fasta格式的cDNA.fa文件，我想删除ENST名称（文字ID）以外的所有fasta标头。如何删除它们？我不想丢失任何cDNA序列。预先感谢。

@H_502_3@>ENST00000390567.1 cdna chromosome:GRCh38:14:105881034:105881053:-1 gene:ENSG00000211907.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD1-26 description:immunoglobulin heavy diversity 1-26 [Source:HGNC Symbol;Acc:HGNC:5485]
GGTATAGTGGGAGCTACTAC
>ENST00000452198.1 cdna chromosome:GRCh38:14:105881539:105881556:-1 gene:ENSG00000225825.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD6-25 description:immunoglobulin heavy diversity 6-25 [Source:HGNC Symbol;Acc:HGNC:5516]
GGGTATAGCAGCGGCTAC
>ENST00000390569.1 cdna chromosome:GRCh38:14:105883903:105883922:-1 gene:ENSG00000211909.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD5-24 description:immunoglobulin heavy diversity 5-24 (non-functional) [Source:HGNC Symbol;Acc:HGNC:5510]
GTAGAGATGGCTACAATTAC
>ENST00000437320.1 cdna chromosome:GRCh38:14:105884870:105884888:-1 gene:ENSG00000227196.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4-23 description:immunoglobulin heavy diversity 4-23 (non-functional) [Source:HGNC Symbol;Acc:HGNC:5504]
TGACTACGGTGGTAACTCC
>ENST00000390571.1 cdna chromosome:GRCh38:14:105886031:105886061:-1 gene:ENSG00000211911.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD3-22 description:immunoglobulin heavy diversity 3-22 [Source:HGNC Symbol;Acc:HGNC:5497]
GTATTACTATGATAGTAGTGGTTATTACTAC

我希望我的文件看起来像这样：

@H_502_3@>ENST00000390567.1 
GGTATAGTGGGAGCTACTAC
>ENST00000452198.1 
GGGTATAGCAGCGGCTAC

解决方法

Fasta 文件通常用 awk 快速处理。他们甚至设计了一个特殊的awk，即bioawk，来处理fasta文件。

在标准 awk 中，您可以：

awk '/>/{$0=$1}1' file.fasta

在 bioawk 中，做这些简单的任务有点复杂

bioawk -c fastx '{sub(/ */,"",$name}{ print ">"$name ORS $seq }' file.fasta

注意： BioAwk 基于 Brian Kernighan's awk 中记录的 "The AWK Programming Language",by Al Aho,Brian Kernighan,and Peter Weinberger (Addison-Wesley,1988,ISBN 0-201-07981-X) 。我不确定这个版本是否与 POSIX 兼容。

使用此Perl单线版：

perl -lpe 's/^(>\S+).*/$1/' input.fa > output.fa

Perl单行代码使用以下命令行标志：
-e：告诉Perl在代码中而不是在文件中查找代码。
-p：一次循环输入一行，默认情况下将其分配给$_。每次循环迭代后添加print $_。
-l：在直接执行代码之前，先剥离输入行分隔符（默认为* NIX上的{"\n"），并在打印时附加它。

s/^(>\S+).*/$1/：在此替换中，^是行的开头，>是文字>，它标记了fasta标头\S+是重复1次或更多次的非空白字符，而.*是重复0次或更多次的任何字符（与从第一个空白字符开始的整行匹配）。通过添加括号，(>\S+)将所需的内容捕获到第一个捕获变量$1中，然后我们将其用于替换整行。

另请参见：
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

Perl有一个很好的答案，但是如果您更喜欢Python，则可以尝试Bioython。

根据您的问题，可以使用以下命令解析sample.fasta文件：

from Bio import SeqIO

for seq_record in SeqIO.parse("sample.fa","fasta"):
    print(seq_record.id)
    print(str(seq_record.seq))

输出：

ENST00000390567.1
GGTATAGTGGGAGCTACTAC
ENST00000452198.1
GGGTATAGCAGCGGCTAC
ENST00000390569.1
GTAGAGATGGCTACAATTAC
ENST00000437320.1
TGACTACGGTGGTAACTCC
ENST00000390571.1
GTATTACTATGATAGTAGTGGTTATTACTAC

如果要将其写回到文件中，

from Bio import SeqIO

lines = [[sq_rec.id,sq_rec.seq] for sq_rec in SeqIO.parse("sample.fa","fasta")]
with open("out.fa","w") as out_file:
    out_file.write('\n'.join(f"{id_}\n{seq}" for id_,seq in lines) + "\n")

更简单的是，仅使用bash cut（您的数据在seqs.fa中）

$ cut -f 1 -d" " seqs.fa 
>ENST00000390567.1
GGTATAGTGGGAGCTACTAC
>ENST00000452198.1
GGGTATAGCAGCGGCTAC
>ENST00000390569.1
GTAGAGATGGCTACAATTAC
>ENST00000437320.1
TGACTACGGTGGTAACTCC
>ENST00000390571.1
GTATTACTATGATAGTAGTGGTTATTACTAC