问题描述
我正在尝试读取fasta文件,并将序列作为单独的氨基酸显示为数据框。 1 seq = 1列
这是到目前为止我得到的:
FASTA_test.txt 包含:
>sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2
MatanSIIVLddddEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK
LENEKLFEEFLELCKMQTADHPEVVPFLYNRQQRAHSLFLASAEFCNILSRVLSRARSRP
AKLYVYINELCTVLKAHSAKKKLNLAPAATTSNepsgNNPPTHLSLDPTNAENTASQSPR
TRGSRRQiqrLEQLLALYVAEIRRLQEKELDLSELDDPDSAYLQEARLKRKLIRLFGRLC
ELKDCSSLTGRVIEQRIPYRGTRYPEVNRRIERLINKPGPDTFPDYGDVLRAVEKAAARH
SLGLPRQQLQLMAQDAFRDVGIRLQERRHLDLIYNFGCHLTDDYRPGVDPALSDPVLARR
LRENRSLAMSRLDEVISKYAMLQDKSEEGERKKRRARLQGTSSHSADTPEASLDSGEGPS
GMASQGcpsASRAETDDEDDEESDEEEEEEEEEEEEEATDSEEEEDLEQMQEGQEDDEEE
DEEEEAAAGKDGDKSPMSSLQISNEKNLEPGKQISRSSGEQQNKGRIVSPsllSEEPLAP
SSIDAESNGEQPEELTLEEESPVsqlFELEIEALPLDTPSsveTdisSSRKQSEEPFTTV
LENGAGMVsstSFNGGVSPHNWGDSGPPCKKSRKEKKQTGSGPLGNSYVERQRSVHEKNG
KKICTLPSPPSPLASLAPVADsstRVDSPSHGLVTSSLCIPSPARLSQTPHSQPPRPGTC
KTSVATQCDPEEIIVLSDSD
>sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3
MEPAPARSPRPQQDPARPQEPTMPPPETPSEGRQPSPSPSPteraPASEEEFQFLRCQQC
QAEAKCPKLLPCLHTLCSGCLEASGMQCPICQAPWPLGADTPALDNVFFESLQRRLSVYR
QIVDAQAVCTRCKESADFWCFECEQLLCAkcfEAHQWFLKHEARPLAELRNQSVREFLDG
TRKTNNIFCSNPNHRTPTLTSIYCRGCSKPLCCSCALLDSSHSELKCDISAEIQQRQEEL
damTQALQEQDSAFGAVHAQMHAAVGQLGRaraETEELIRERVRQVVAHVRAQERELLEA
VDARYQRDYEEMASRLGRLDAVLQRIRTGSALVQRMKCYASDQEVLDMHGFLRQALCRLR
QEEPQSLQAAVRTDGFDEFKVRLQDLSSCITQGKDAAVSKKASPEAASTPRDPIDVDLPE
EAERVKAQVQALGLAEAQPMAVVQSVPGahpVPVYAFSIKGPSYGEDVSNTTTAQKRKCS
QTQCPRKVIKMESEEGKEARlarsSPEQPRPSTSKAVSPPHLDGPPSPRSPVIGSEVFLP
NSNHVASGAGEAEERVVVISSSEDSDAENSSSRELDDSSSESSDLQLEGPSTLRVLDENL
ADPQAEDRPLVFFDLKIDNETQKIsqlAAVNRESKFRVVIQPEAFFSIYSKAVSLEVGLQ
HFLSFLSSMRRPILACYKLWGPGLPNFFRALEDINRLWEFQEAISGFLAALPLIRERVPG
ASSFKLKNLAQTYLARNMSERSAMAAVLAMRDLCRLLEVSPGPQLAQHVYPFSSLQCFAS
LQPLVQAAVLPRAEARLLALHNVSFMELLSAHRRDRQGGLKKYSRYLSLQTTTLPPAQPA
FNLQALGTYFEGLLEGPALaraEGVSTPLAGRGLAERASQQS
我的代码:
library("Biostrings")
fastaFile <- readAAStringSet("~/Desktop/FASTA_test.txt")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name,sequence)
view(df)
#separate the aa into separate columns
df_splited_1 <- as.data.frame(do.call(cbind,apply(df,1,function(x) {
do.call(expand.grid,strsplit(df$sequence,""))
})))
view(df_splited_1)
我面临的问题是上面的脚本将氨基酸分开,但是将它们放在一个单独的列中,而不是将这些列分开保存。
dput(fastaFile)
new("AAStringSet",pool = new("SharedRaw_Pool",xp_list = list(
<pointer: 0x0>),.link_to_cached_object_list = list(<environment>)),ranges = new("GroupedIRanges",group = c(1L,1L),start = c(1L,741L),width = c(740L,882L),NAMES = c("sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2","sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3"
),elementType = "ANY",elementMetadata = NULL,Metadata = list()),elementType = "AAString",Metadata = list())
感谢您的帮助!
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)