将 PMCID 表行解析为列形式

问题描述

dput(t1)
structure(list(PMCID = c("PMC7809753","PMC7809753","PMC7790830","PMC7790830"),table = c("Table 1","Table 1","Table 1"),row = c(1L,2L,3L,4L,5L,1L,5L),text = c("Drug=Cytarabine (ara-C); Target=DNA polymerases; Influx=ENT1,CNT3,OCTN1; Metabolisma=Activation: dCK,dCMPK,NDK. Inactivation: CDA,dCMPD,PN-I.; Efflux=MRP4,7,8; Refs.=[14,30–33,78–80]","Drug=Daunorubicin (DNR); Target=DNA,Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp,MRP1,BCRP; Refs.=[44,51,81–84]","Drug=Mitoxantrone (MX); Target=DNA,85–90]","Drug=etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp,MRP1-3,6,BCRP; Refs.=[16,91,92]","Drug=Methotrexate (MTX); Target=DHFR,TS,AICArft; Influx=RFC,PCFT; Metabolisma=Aldehyde oxidase,FPGS (polyglutamylation); Efflux=P-gp,MRP1-5,93,94]","Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; Cell count(×109/l): plt=9; BM Blast (%)=70.5; Karyotype=46,XX,t(8,21)(q22;q22)","Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103; Cell count(×109/l): plt=62; BM Blast (%)=60.4; Karyotype=46,XX","Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; Cell count(×109/l): plt=100; BM Blast (%)=88; Karyotype=45,XY,-7","Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; Cell count(×109/l): plt=52; BM Blast (%)=86.8; Karyotype=46,XY","Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; Cell count(×109/l): plt=197; BM Blast (%)=32.4; Karyotype=46,XX"
)),row.names = c(NA,-10L),class = c("tbl_df","tbl","data.frame"
))

上面是我的示例数据框，看起来像这样

head(t1)
# A tibble: 6 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (ara-C); Target=DNA polymerases; Influx=ENT1,NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA,BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA,…
4 PMC7809753 Table…     4 Drug=etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp,…
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR,FPGS (polyglutam…
6 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …

例如这篇论文 PMC7809753 paper 其输出在上面。在论文中，第一张表是“AML 中使用的化疗药物的特性”，看起来像这样。在我的数据框中，PMC7809753 ID 的表 1 重复了 5 次，这与我附上的上述图片相对应。

现在的问题是如何将特定 PMCID 的每个表解析为表格或列状结构，如论文中所示。

更新根据我的 PMCID，我可以将每一行拆分为一个列表。

aa <- split(t1,f = t1$PMCID)

这给了我这个

$PMC7790830
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …
2 PMC7790830 Table…     2 Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103…
3 PMC7790830 Table…     3 Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; …
4 PMC7790830 Table…     4 Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; C…
5 PMC7790830 Table…     5 Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; …

$PMC7809753
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (ara-C); Target=DNA polymerases; Influx=ENT1,FPGS (polyglutam…

更新 v2

我尝试根据以下解决方案将相同的 PMCID 行分成一个。

Convert duplicate rows to separate columns in R

library(splitstackshape)
library(data.table)
DT <- setDT(t1)[,do.call(paste,c(.SD,list(collapse=','))),PMCID]
DT1 <- cSplit(DT,'V1',sep='[,]+',fixed=FALSE,stripwhite=TRUE)
setnames(DT1,2:ncol(DT1),rep(names(t1)[-1],41))
DT1

所以问题仍然如上所示，我如何将与列表相对应的那些行分离和分离成列或某种表格形式，如图所示。

解决方法

我认为将 tidypmc 包与 europepmc 输出一起使用可能会有所帮助。以下是使用 pmc_table 从 PMC 文章中提取第一个表的示例。这也在 map 中使用 purrr 中的 tidyverse。

library(tidypmc)
library(tidyverse)
library(europepmc)

doc <- map("PMC7809753",epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]

输出

# A tibble: 7 x 6
  Drug                Target           Influx            Metabolisma                                 Efflux         Refs.        
  <chr>               <chr>            <chr>             <chr>                                       <chr>          <chr>        
1 Cytarabine (Ara-C)  DNA polymerases  ENT1,CNT3,OCTN1 "Activation: dCK,dCMPK,NDK. Inactivation… MRP4,7,8       [14,30–33,…
2 Daunorubicin (DNR)  DNA,Topoisomer… Passive diffusion ""                                          P-gp,MRP1,… [44,51,81–…
3 Mitoxantrone (MX)   DNA,B… [44,85–90]  
4 Etoposide (VP-16)   Topoisomerase II Passive diffusion ""                                          P-gp,MRP1-3,… [16,91,92] 
5 Methotrexate (MTX)  DHFR,TS,AICAR… RFC,PCFT         "Aldehyde oxidase,FPGS (polyglutamylation… P-gp,MRP1-5,93,94] 
6 Venetoclax (VEN)    Bcl-2            Passive diffusion ""                                          P-gp           [72,95]     
7 Gemtuzumab Ozogami… DNA              Ab-mediated endo… "Lysosomal Calicheamicin cleavage from Ab,… P-gp,MRP1     [73,77]

编辑 (1/30/21)：要为多篇文章自动执行此过程（并根据您的其他问题和方法），请考虑以下事项。

您可以拥有一个包含 pmcids 的向量，并将其与 map 一起使用。这将创建 docs，其中包含所有 pmcids 文章的所有 xml。

然后您可以再次使用 map 将所有表存储在 my_tables 中，这将是一个列表。

b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 6)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
docs <- map(pmcids,epmc_ftxt)
my_tables <- map(docs,pmc_table)

然后您可以通过以下方式访问，例如，文章 2 表 1：

my_tables[[2]][[1]]

编辑 (1/31/21)： 要将每篇文章的名称设置为 PMCID，您可以使用 set_names，并使用 %>% 和 {{ 1}}。 map 将为您的矢量添加名称。当您调用此函数但不提供其他名称时，它将使用矢量元素作为名称。例如：

set_names

之后您可以单独调用 docs <- pmcids %>% set_names() %>% map(.,epmc_ftxt)，如果只对表格感兴趣，而不是对完整文档感兴趣，甚至可以将其添加到链中（将整个内容存储为 my_tables <- map(docs,pmc_table)）。

最终，您可以像这样使用 PMCID 访问单个表：

my_tables

europepmc parsing parsing r r