如何将Fasta字符串分成多个行,以保证r中未更改的列数?

问题描述

我正在尝试读取fasta文件,并将序列作为单独的氨基酸显示为数据框。 1 seq = 1列

这是到目前为止我得到的:

FASTA_test.txt 包含:

>sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2
MatanSIIVLddddEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK
LENEKLFEEFLELCKMQTADHPEVVPFLYNRQQRAHSLFLASAEFCNILSRVLSRARSRP
AKLYVYINELCTVLKAHSAKKKLNLAPAATTSNepsgNNPPTHLSLDPTNAENTASQSPR
TRGSRRQiqrLEQLLALYVAEIRRLQEKELDLSELDDPDSAYLQEARLKRKLIRLFGRLC
ELKDCSSLTGRVIEQRIPYRGTRYPEVNRRIERLINKPGPDTFPDYGDVLRAVEKAAARH
SLGLPRQQLQLMAQDAFRDVGIRLQERRHLDLIYNFGCHLTDDYRPGVDPALSDPVLARR
LRENRSLAMSRLDEVISKYAMLQDKSEEGERKKRRARLQGTSSHSADTPEASLDSGEGPS
GMASQGcpsASRAETDDEDDEESDEEEEEEEEEEEEEATDSEEEEDLEQMQEGQEDDEEE
DEEEEAAAGKDGDKSPMSSLQISNEKNLEPGKQISRSSGEQQNKGRIVSPsllSEEPLAP
SSIDAESNGEQPEELTLEEESPVsqlFELEIEALPLDTPSsveTdisSSRKQSEEPFTTV
LENGAGMVsstSFNGGVSPHNWGDSGPPCKKSRKEKKQTGSGPLGNSYVERQRSVHEKNG
KKICTLPSPPSPLASLAPVADsstRVDSPSHGLVTSSLCIPSPARLSQTPHSQPPRPGTC
KTSVATQCDPEEIIVLSDSD
>sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3
MEPAPARSPRPQQDPARPQEPTMPPPETPSEGRQPSPSPSPteraPASEEEFQFLRCQQC
QAEAKCPKLLPCLHTLCSGCLEASGMQCPICQAPWPLGADTPALDNVFFESLQRRLSVYR
QIVDAQAVCTRCKESADFWCFECEQLLCAkcfEAHQWFLKHEARPLAELRNQSVREFLDG
TRKTNNIFCSNPNHRTPTLTSIYCRGCSKPLCCSCALLDSSHSELKCDISAEIQQRQEEL
damTQALQEQDSAFGAVHAQMHAAVGQLGRaraETEELIRERVRQVVAHVRAQERELLEA
VDARYQRDYEEMASRLGRLDAVLQRIRTGSALVQRMKCYASDQEVLDMHGFLRQALCRLR
QEEPQSLQAAVRTDGFDEFKVRLQDLSSCITQGKDAAVSKKASPEAASTPRDPIDVDLPE
EAERVKAQVQALGLAEAQPMAVVQSVPGahpVPVYAFSIKGPSYGEDVSNTTTAQKRKCS
QTQCPRKVIKMESEEGKEARlarsSPEQPRPSTSKAVSPPHLDGPPSPRSPVIGSEVFLP
NSNHVASGAGEAEERVVVISSSEDSDAENSSSRELDDSSSESSDLQLEGPSTLRVLDENL
ADPQAEDRPLVFFDLKIDNETQKIsqlAAVNRESKFRVVIQPEAFFSIYSKAVSLEVGLQ
HFLSFLSSMRRPILACYKLWGPGLPNFFRALEDINRLWEFQEAISGFLAALPLIRERVPG
ASSFKLKNLAQTYLARNMSERSAMAAVLAMRDLCRLLEVSPGPQLAQHVYPFSSLQCFAS
LQPLVQAAVLPRAEARLLALHNVSFMELLSAHRRDRQGGLKKYSRYLSLQTTTLPPAQPA
FNLQALGTYFEGLLEGPALaraEGVSTPLAGRGLAERASQQS

我的代码

library("Biostrings")
fastaFile <- readAAStringSet("~/Desktop/FASTA_test.txt")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name,sequence)
view(df)

#separate the aa into separate columns
df_splited_1 <- as.data.frame(do.call(cbind,apply(df,1,function(x) {
  do.call(expand.grid,strsplit(df$sequence,""))
})))

view(df_splited_1)

我面临的问题是上面的脚本将氨基酸分开,但是将它们放在一个单独的列中,而不是将这些列分开保存。

dput(fastaFile)
new("AAStringSet",pool = new("SharedRaw_Pool",xp_list = list(
    <pointer: 0x0>),.link_to_cached_object_list = list(<environment>)),ranges = new("GroupedIRanges",group = c(1L,1L),start = c(1L,741L),width = c(740L,882L),NAMES = c("sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2","sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3"
    ),elementType = "ANY",elementMetadata = NULL,Metadata = list()),elementType = "AAString",Metadata = list())

感谢您的帮助!

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)