构建一个简单的数据帧DADA2 Pipeline 过渡到 Phyloseq

问题描述

我使用自己的数据成功完成了 DADA2 流水线教程 (https://benjjneb.github.io/dada2/tutorial.html)，但在过渡到 Phyloseq 时遇到了困难。我需要根据文件名中编码的信息构建一个简单的 data.frame。这是教程中提供的代码。

#Make a data.frame holding the sample data
samples.out <- rownames(seqtab.nochim)
subject <- sapply(strsplit(samples.out,"D"),`[`,1)
gender <- substr(subject,1,1)
subject <- substr(subject,2,999)
day <- as.integer(sapply(strsplit(samples.out,2))
samdf <- data.frame(Subject=subject,Gender=gender,Day=day)
samdf$When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
rownames(samdf) <- samples.out

我的应该比这更简单，因为我没有时间作为一个因素。我只有六个治疗组。

这是我想弄清楚的。

#Make a data.frame holding the sample data
samples.out <- rownames(seqtab.nochim)

#create vector with the treatments
trtmt <- c("EM","EP","EM","AR37","NEA2","AR1","Ctrl","AR37")

#Add a new column to the samples.out dataframe 
samples.out_2 <- samples.out
samples.out_2 <- cbind(samples.out,new_col = trtmt)

#Rename columns
colnames(samples.out_2)[colnames(samples.out_2) == "samples.out"] <- "Sample"
colnames(samples.out_2)[colnames(samples.out_2) == "new_col"] <- "Treatment"

#Head of my samples.out_2 data frame (I have a total of 39 samples and 6 treatment groups)
Sample Treatment
193    EM
194    EP
196    EM
197    AR37
198    NEA2

#Still stuck with how to make this relevant to my Metadata!
sample <- sapply(strsplit(samples.out_2,1) #what does the "D" mean (I think it has to do with the mouse dataset used in the tutorial)? However,I am not sure what I need to pull from my data.frame. Also,What does '[' mean? I kNow the meanings for operators like [],(),etc.,but not for a single one in quotes.
treatment <- substr(sample,39) #I don't understand what I am trying to extract or change
sample <- substr(sample,999) #I don't understand what I am trying to extract or change
samdf <- data.frame(Sample=sample,Treatment=treatment)
rownames(samdf) <- samples.out

如果有人使用自己的数据阅读了本教程并理解了这种转变，我将非常感谢您的见解。谢谢

解决方法

您想使用名为 samdf 的对象中的元数据创建数据框（如教程中所述）。在本教程中，序列的元数据编码在其文件名中（您的数据似乎并非如此）：

例如第一个

F3D0 : 性别 (F)-主题-(no3)-天 (D0)

教程中用于定义 Subject、Gender 和 Day 的代码行与您的数据无关。

subject <- sapply(strsplit(samples.out,"D"),`[`,1) # define subject as beginning of the filename string up to D
gender <- substr(subject,1,1) #gets first letter for the gender
subject <- substr(subject,2,999) #remove gender to actually get the subject number
day <- as.integer(sapply(strsplit(samples.out,2)) #define day

最后两行很重要，第一行使用元数据创建数据框，第二行分配与 seqtab.nochim 中相同的行名，以便您可以在管道中进一步构建 phyloseq 对象。确保 samdf 和 seqtab.nochim 具有相同的行数：

isTRUE(dim(seqtab.nochim)[1] == dim(samdf)[1]) #should be true

phyloseq r r