改进和加速代码以确定大量组合 结果

问题描述

好吧,我将描述真实数据而不是 reprex,因为我认为这不会使它变得更容易,但为了澄清这一切,这个问题涉及一个微小的生物化学 101。

我使用 DNA 诱变文库,其中某些 DNA 位置是随机化的,从而导致蛋白质也具有随机化的氨基酸位置。 DNA由核苷酸(其中有四个,GATC)和一个氨基酸(其中有20个,用字母表示)组成,由一组三个核苷酸(一个“密码子”)编码。

我有两个描述密码子-氨基酸关系的向量:

cods <- c("GCT","GCC","GCA","GCG","CGT","CGC","CGA","CGG","AGA","AGG","AAT","AAC","GAT","GAC","TGT","TGC","CAA","CAG","GAA","GAG","GGT","GGC","GGA","GGG","CAT","CAC","TAA","TAG","TGA","ATT","ATC","ATA","CTT","CTC","CTA","CTG","TTA","TTG","AAA","AAG","ATG","TTT","TTC","CCT","CCC","CCA","CCG","TCT","TCC","TCA","TCG","AGT","AGC","ACT","ACC","ACA","ACG","TGG","TAT","TAC","GTT","GTC","GTA","GTG")
aas <- c("A","A","R","N","D","C","Q","E","G","H","*","I","L","K","M","F","P","S","T","W","Y","V","V")

随机位置允许密码子中某个位置的某些核苷酸,并由特定(不相关)字母表示,因此,例如核苷酸密码子“NYS”允许第一个位置的所有四个核苷酸 (GATC),但只有 AG 在位置 2 和 AC 在位置 3。我现在创建了 NYS 和另一个图书馆的所有可能的三元组,如下所示:

NYS <- expand.grid(list(c("A","T"),c("C","G")))
VRM <- expand.grid(list(c("A","G"),c("A","C")))

然后我计算所有这些组合的相应氨基酸:

# make codon triplet strings
NYS[,"cods"] <- paste(NYS$Var1,NYS$Var2,NYS$Var3,sep='')
VRM[,"cods"] <- paste(VRM$Var1,VRM$Var2,VRM$Var3,sep='')

#look them up in the aa vector and add a column
NYS[,"aas"] <- aas[match(NYS$cods,cods)]
VRM[,"aas"] <- aas[match(VRM$cods,cods)]

#get only the relevant columns
VRM <- VRM %>% select("aas","cods")
NYS <- NYS %>% select("aas","cods")
NYS$cods <- "NYS"
VRM$cods <- "VRM"

现在是棘手的部分:根据某个输入向量,描述随机密码子的数量和类型,例如 library_cods <- c("NYS","VRM","NYS","VRM")

我现在想计算这些文库中可能出现的所有氨基酸序列。然后我想创建一个包含所有唯一序列和出现次数的数据框。我是这样做的:

# make a string that contains n sort()s of the columns as determined by library_cods,evaluate,expand
all_combos <- expand.grid(lapply(str_split(paste(gsub("(...)","sort(\\1\\$aas)",library_cods),collapse = ","),",simplify = T),function(x) eval(parse(text=x))))

# get the string from the rows
unique_seqs <- apply(all_combos,1,function(x) paste(x,collapse = ""))

#rle() to count
unique_seqs <- data.frame(unclass(rle(sort(unique_seqs))))

#sort by count
unique_seqs <- unique_seqs[order(unique_seqs$lengths,decreasing = T),]

这一切正常,但问题是它真的很慢。所以我的主要问题是,我怎样才能让它更快? 在我的系统上,执行 rle() 和之后的两行需要 70 秒。相比之下,bash 中的 sort -n file | uniq -c | sort -n 在相同数据上花费了大约 22 秒。虽然这样更好,但它仍然很慢,所以我想也许我应该做一些数学而不是计算和计数^^

作为一个附带问题;我也觉得我的代码很笨拙(特别是 all_combos <- 行;我知道将字符串作为代码进行评估真的很糟糕);如果有人想指出如何提高我的代码的效率,我也将不胜感激。

解决方法

您的代码的某些步骤可以变得更简洁。对于三元组,只需要向量,我们稍后使用 mget 获取它们。

NYS <- expand.grid(list(c("A","C","G","T"),c("C","G")))
VRM <- expand.grid(list(c("A","G"),c("A","C")))

## triplets
NYS <- aas[match(Reduce(paste0,NYS),cods)]
VRM <- aas[match(Reduce(paste0,VRM),cods)]

## input vector
library_cods <- c("NYS","VRM","NYS","VRM")

# columns as determined by library_cods,evaluate,expand
all_combos <- expand.grid(mget(library_cods))

# get the string from the rows
unique_seqs <- Reduce(paste0,all_combos)

# sort by count
unique_seqs <- data.frame(sort(table(unique_seqs),decreasing=T))

结果

head(unique_seqs)
#   unique_seqs Freq
# 1      LRLLRR  729
# 2      ARLLRR  486
# 3      LGLLRR  486
# 4      LRALRR  486
# 5      LRLARR  486
# 6      LRLLGR  486

在我的系统上运行大约 16 秒,这是合理的。