从单个PubMed记录中提取关联数据

问题描述

通过使用easyPubMed和大量搜索(我对R还是很陌生),我已经成功地从单个发布的记录中提取了关联数据。数据的问题在于,它仅报告一部分隶属关系信息,我认为这是由于非标准字符串中的各种信息所致。

我的代码如下:

#PubMed query via easyPubMed using the URL of the XML

my_query <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20301425&retmode=xml"
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id,format = "abstract")
print(my_abstracts_txt[1:16])


my_abstracts_xml <- fetch_pubmed_data(my_entrez_id)
class(my_abstracts_xml)


print(my_titles)


#EasyPubMed Extracting Affiliation data from a single PubMed Record

#Convert XML PubMed records to strings using the articles_to_list function
#Each record in the list is a string that still includes XML tags
my_PM_list <- articles_to_list(my_abstracts_xml)
class(my_PM_list[[4]])
cat(substr(my_PM_list[[4]],1,984))

#Affiliation can be extracted from a specific record using the custom_grep() function
#The fields extracted from the record will be returned as elements of a list or a character vector

curr_PM_record <- my_PM_list[[(length(my_PM_list) - 3)]]
Affiliation_Info.data <- custom_grep(curr_PM_record,tag = "AffiliationInfo")

View(Affiliation_Info)


curr_PM_record <- my_PM_list[[(length(my_PM_list) - 3)]]

理想情况下,我想产生一个数据框,例如: PMID:作者:关联公司

(但首先要集中精力从发布的URL中提取所有关联信息)

但是我确实很努力地做到这一点,并希望在此问题上的任何帮助

谢谢!

解决方法

这是一种xml2的方法...

library( xml2 )
library( magrittr )

#read the xml-data
doc <- xml2::read_xml( "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20301425&retmode=xml" )

pmid    <- xml2::xml_find_first( doc,".//PMID") %>% xml2::xml_text()
authors <- paste( 
  xml2::xml_find_all( doc,".//AuthorList[@Type = 'authors']/Author/LastName") %>% xml2::xml_text(),xml2::xml_find_all( doc,".//AuthorList[@Type = 'authors']/Author/ForeName") %>% xml2::xml_text(),sep = "," )
affiliate <- xml2::xml_find_all( doc,".//AuthorList[@Type = 'authors']/Author/AffiliationInfo/Affiliation") %>% xml2::xml_text()

df <- data.frame( pmid = pmid,authors = authors,affiliate = affiliate )

whi看起来像:

enter image description here