R-匹配索引中嵌套列表值和返回值的组合

问题描述

嗨,我有两个数据集。第一个是与给定簇(0-7)相关的基因列表:

# gene output

Cluster <- rep(0:7,each = 10)

Gene <- c("LMO3","NEUROD6","NFIB","SNAP25","RTN1","CPE","SOX11","CSRP2","VAMP2","ID2","EMX2","LHX5-AS1","PEG10","HES1","TRH","WLS","TPBG","RPS29","CRABP2","RSPO3","RPL17","RPL7","PTMA","RPL36A","HMGN2","H2AFZ","PABPC1","HNRNPH1","PTN","FABP7","IGFBP2","ID4","C1orf61","VIM","RPS27L","FABP5","SDCBP","BNIP3","TCF7L2","NEFL","HMGCS1","GAP43","GPM6A","sqlE","MSMO1","SCOC","BASP1","TTR","MEST","MDK","TMBIM6","RCN1","C8orf59","ID3","PKM","NCOR1","ELAVL4","NNAT","ETFB","STMN2","TUBA1A","GNG3","MALAT1","SOX4","TUBB2B","CRYAB","GFAP","CHCHD2","HOPX","LgalS1","SCRG1","ISG15","AC090498.1","B2M","CLU")

df <- data.frame(cbind(Cluster,Gene))

第二个是为特定基因组合提供细胞类型注释的索引:

# index

Type <- c("Radial Glia","Excitatory Neuron ","Inhibitory Neuron","IPC","Radial Glia","Microglia","Inhibitory Neuron")

Subtype <- c("early","Layer IV","sst-MGE1","IPC-div2","Parietal and Temporal","oRG/Astrocyte","IPC-new","MGE2")

Markers <- c("TOP2A AURK HMGB CTNNB1","PPP1R1B SCN2A RORB CRYM","dlx6-AS1 dlx1 sst DCX","ERBB4 sst dlx2 dlx5 dlx6-AS1","CCNB2 NEUROD4 KIF15 PENK HES6 ZFHX4 GLI3","MEF2C STMN2 FLT ROBO CRYM","AQP4 GFAP AGT dio2 IL33","C1QB aif1 ccl4 C1QC","CENPK EOMES","CCK LHX6 SCGN sst")

index <- data.frame(cbind(Type,Subtype,Markers))

我正在尝试从df基因列表中找到Markers中概述的特定组合。当找到这样的匹配项时,将返回相应的类型和子类型。 但是,我发现有很多警告需要绕开我的头。

  1. 每个聚类的列表可能包含多个标记组合-因此该功能应迭代遍历每个标记组合,而不是在找到第一个匹配项时停止。
  2. 索引匹配过程应分别在每个聚类上进行-即检查聚类0中的基因是否存在标记匹配和返回类型/亚型,然后重复聚类1等步骤。

我的项目数据包含数十个类似df的输出,这些输出由不同数量的各个簇组成,每个簇包含数百至数千个基因。我已经尽力了 在网上搜索解决方案,但很遗憾,我在这里画了一个空白。

任何帮助/建议/建议将不胜感激。

编辑:

输出看起来像这样:

  Cluster    Gene        Type Subtype
1       0    LMO3 Radial Glia   early
2       0 NEUROD6        <NA>    <NA>
3       0    NFIB        <NA>    <NA>
4       0  SNAP25        <NA>    <NA>
5       0    RTN1        <NA>    <NA>
6       0     CPE        <NA>    <NA>

正确的匹配会在df中添加一行,并为每个聚类添加相应的类型和子类型,而其余部分为空(NA)。

解决方法

执行此操作的方法可能更简单,但这里有一个循环;

output = data.frame(Cluster=as.character(),Gene=as.character(),Type=as.character(),Subtype=as.character())

for(i in 1:nrow(df)){
  cluster = df[i,1]
  gene = df[i,2]
  type = index[grep(gene,index$Markers),]
  n_types = nrow(type)
  tmp = data.frame(Cluster=rep(cluster,n_types),Gene=rep(gene,Type=type[,1],Subtype=type[,2])
  output = rbind(output,tmp)
}
,

我假设您想用以下类型注释每个基因簇 当类型的所有标记都出现在集群的索引中时 基因库。

我还将使用一些简化的数据集;两种简化的类型 索引:

library(tidyverse)

index <- bind_rows(
  tibble(type = "AB",subtype = "X",markers = c("A","B")),tibble(type = "BC",subtype = "Y",markers = c("B","C")),)

index
#> # A tibble: 4 x 3
#>   type  subtype markers
#>   <chr> <chr>   <chr>  
#> 1 AB    X       A      
#> 2 AB    X       B      
#> 3 BC    Y       B      
#> 4 BC    Y       C

以及说明不同匹配方案的三个不同的集群:

clusters <- bind_rows(
  tibble(cluster = 0,genes = c("A","B",# 2 matches
  tibble(cluster = 1,genes = c("B","C","D")),# 1 match
  tibble(cluster = 2,genes = c("C","D","E")),# No matches
)

clusters
#> # A tibble: 9 x 2
#>   cluster genes
#>     <dbl> <chr>
#> 1       0 A    
#> 2       0 B    
#> 3       0 C    
#> 4       1 B    
#> 5       1 C    
#> 6       1 D    
#> 7       2 C    
#> 8       2 D    
#> 9       2 E

我将首先创建一个返回匹配类型的函数来解决这个问题 对于给定的基因库:

match_index <- function(genes) {
  matches <- index %>% 
    group_by(type,subtype) %>% 
    filter(all(markers %in% genes)) %>% 
    distinct(type,subtype)

  # If none matched,return a row of NAs  
  if (nrow(matches)) matches else matches[NA_integer_,]
}

然后,您可以使用以下功能总结每个集群:

clusters %>% 
  group_by(cluster) %>% 
  summarise(match_index(genes))
#> `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups:   cluster [3]
#>   cluster type  subtype
#>     <dbl> <chr> <chr>  
#> 1       0 AB    X      
#> 2       0 BC    Y      
#> 3       1 BC    Y      
#> 4       2 <NA>  <NA>