我尝试编写一个程序,读取包含多个DNA序列的FASTA格式文件,识别序列中所有重复的4聚体(即,多次出现的所有4聚体),并打印出重复的4聚体以及查找它的序列的标题. k聚体仅仅是k个核苷酸的序列(例如,“aaca”,“gacg”和“tttt”是4聚体).
这是我的代码:
use strict; use warnings; my $count = -1; my $file = "sequences.fa"; my $seq = ''; my @header = (); my @sequences = (); my $line = ''; open (READ,$file) || die "Cannot open $file: $!.\n"; while ($line = <READ>){ chomp $line; if ($line =~ /^>/){ push @header,$line; $count++; unless ($seq eq ''){ push @sequences,$seq; $seq = ''; } } else { $seq .= $line; } } push @sequences,$line; for (my $i = 0; $i <= $#sequences+1; $i++){ if ($sequences[$i] =~ /(....)(.)*\g{1}+/g){ print $header[$i],"\n",$&,"\n"; } }
我有两个请求:首先,我不知道如何设计我的正则表达式模式以获得所需的输出.
第二,不太重要的是,我确信我的代码效率非常低,所以如果有办法缩短代码,请告诉我.
提前致谢!
以下是FASTA文件的示例:(请注意,序列之间有一条额外的行,原始的fasta文件不是这种情况)
>NC_001422.1 Enterobacteria phage phiX174 sensu lato,complete genome
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTttttttCGGATATTTCTGATGAGTCGAAAAAT
CCCTTACTTGAGGATAtatataAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCT>NC_001501.1 Enterobacteria phage phiX184 sensu lato,complete genome
AACGGCTGGTCAGTATTTAAGGTTAGTGCTGAGGTTGACTACATCTGTTTTTAGAGACCCAGACCTTTTA
TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTgagagagaGGTTTTCTTCATTGCATTCAGATGGA
TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC
CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG>NC_001622.5 Enterobacteria phage phiX199 sensu lato,complete genome
TTCGCTGAATCAGGTTATTAAAGAGTTGCCGAGATATTTATGTTGGTTTCATGCGGATTGGTCGTTTAAA
TTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATAATGACCAAATCAAAGAACTCGTGATTAT
CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG
TTGACGCCGGATTTGAGAATCAAAAATGTGAGAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGA
GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC
CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA
TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA
AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC
TTAGATGAGTGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAG
解决方法
#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; #set paragraph mode. Iterate on blank lines. local $/ = ''; #read from STDIN or a file specified on command line,#e.g. cat filename_here | myscript.pl #or myscript.pl filename_here while ( <> ) { #capture the header line,and then remove it from our data block my ($header) = m/\>(.*)/; s/>.*$//; #remove lineFeeds and whitespace. s/\s*\n\s*//g; #use lookahead pattern,so the data isn't 'consumed' by the regex. my @sequences = m/(?=([atcg]{4}))/gi; #increment a count for each sequence found. my %count_of; $count_of{$_}++ for @sequences; #print output. (Modify according to specific needs. print $header,"\n"; print "Found sequences:\n"; print Dumper \@sequences; print "Count:\n"; print Dumper \%count_of; #note - ordered,but includes duplicates. #you Could just use keys %count_of,but that would be unordered. foreach my $sequence ( grep { $count_of{$_} > 1 } @sequences ) { print $sequence," => ",$count_of{$sequence},"\n"; } print "\n"; }
我们按记录迭代记录,捕获并删除“标题”行,然后将其余部分拼接在一起.然后捕获4的每个(重叠)序列,并对它们进行计数.
这样,对于您的样本数据(简洁的第一节):
NC_001422.1 Enterobacteria phage phiX174 sensu lato,complete genome Found sequences: GAGT => 2 AGTT => 2 TTAT => 2 CATG => 2 ATGA => 3 TGAC => 2 CGCA => 2 AGTT => 2 ACTT => 2 tttt => 3 tttt => 3 tttt => 3 GGAT => 2 GATA => 2 ATAT => 2 TATT => 2 ATGA => 3 TGAG => 2 GAGT => 2 AAAA => 2 AAAA => 2 ACTT => 2 TGAG => 2 GGAT => 2 GATA => 2 tata => 2 tata => 2 TTAT => 2 TATG => 2 ATAT => 2 TATT => 2 GCCG => 2 TATG => 2 GCCG => 2 CGCA => 2 CATG => 2 ATGA => 3 TGAC => 2
注意 – 因为它基于原始序列,它基于数据中的排序,你会看到TGAC两次因为……它在那里两次.
但是你可以改为:
foreach my $sequence ( sort { $count_of{$b} <=> $count_of{$a} } grep { $count_of{$_} > 1 } keys %count_of ) { print $sequence,"\n"; } print "\n";
哪个将丢弃任何少于2个匹配,并按频率排序.