使用python elementree xml解析器和循环

问题描述

正在开发一个应用程序 https://share.streamlit.io/carrlucy/hsl_oa/main，该应用程序会递归 europmc 数据库以查找开放数据，并且提供的 restful api 包含一个“nextcursormark”字段，以便查询可以进行分页...

我在如何处理这些信息方面遇到了困难，希望得到一些想法？

我知道我正在寻找的变量存储在 root[2] 的解析变量中

以下工作可用于获得第一组结果（root[4] 是为其他 for 循环提供数据的 xml 元素树，我需要将其包装在另一个循环中，我认为要整理出来，以便每个当它看到另一个 nextcursormark 值时，它会重新创建一个新元素树，然后由以下 for 循环解析？还担心我的代码没有完成，所以这会很简单？所以如果那里有什么没有意义我也会欣赏那里的想法吗？

'''

import math
import pandas as pd
import streamlit as st
import numpy as np
import json
import xml.etree.ElementTree as ET
import urllib.request
import rdflib
import altair as alt
from urllib.request import urlopen
from xml.etree.ElementTree import parse

"""
# Europe PMC Open Data Dashboard
"""

builtQuery=('https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=virginia&resultType=core&cursorMark=*&pageSize=50&format=xml')

#https://www.foxinfotech.in/2019/04/python-how-to-read-xml-from-url.html
restQuery=urlopen(builtQuery)

xmlTree=ET.parse(restQuery)
root = xmlTree.getroot()


   
#https://towardsdatascience.com/converting-multi-layered-xml-files-to-dataframes-in-python-using-xmltree-a13f9b043b48


openAccess=[]
authors=[]
date=[]
title=[]
iso=[]
doi=[]

nextPage=root[2].text

if int(root[1].text)<1000:
    for a in root[4]:
        root1=ET.Element('result')
        root1=a
        for b in root1.iter('isOpenAccess'):
            root2=ET.Element('root')
            
        for c in root1.iter('authorString'):
            root3=ET.Element('root2')
            
        for d in root1.iter('firstPublicationDate'):
            root4=ET.Element('root3')
            
        for e in root1.iter('title'):
            root5=ET.Element('root4')
            
        for f in root1.iter('ISOAbbreviation'):
            root6=ET.Element('root5')
              
        for g in root1.iter('doi'):
            root7=ET.Element('root6')
            
        openAccess.append(b.text)
        authors.append(c.text)
        date.append(d.text)
        title.append(e.text)
        iso.append(f.text)
        doi.append(g.text)
       


df = pd.DataFrame({'Authors':authors,'ArticleTitle':title,'JournalTitle':iso,'date':date,'DOI':doi,'openAccess': openAccess})
df['date'] = pd.to_datetime(df['date'])


openFilter = sorted(df['openAccess'].drop_duplicates()) # select the open access values 
open_Filter = st.sidebar.selectBox('Open Access?',openFilter) # render the streamlit widget on the sidebar of the page using the list we created above for the menu
df2=df[df['openAccess'].str.contains(open_Filter)] # create a dataframe filtered below
st.write(df2.sort_values(by='date'))


df['year']=df['date'].dt.to_period('Y')
df['yearDate'] = df['year'].astype(str)
df3 = df[['yearDate','openAccess']].copy()


valLayer = alt.Chart(df3).mark_bar().encode(x='yearDate',y='count(openAccess)',color='openAccess')

st.altair_chart(valLayer,use_container_width=True)

'''

顺便说一句-我已经修复了 URL，其输出是

'''

<responseWrapper xmlns:slx="http://www.scholix.org" xmlns:epmc="https://www.europepmc.org/data" nighteye="disabled">
<script id="tinyhippos-injected"/>
<version>6.5</version>
<hitCount>277624</hitCount>
<nextCursorMark>AoIIQJRo5Sg0MzQwNzg5MQ==</nextCursorMark>
<request>
<queryString>virginia</queryString>
<resultType>core</resultType>
<cursorMark>*</cursorMark>
<pageSize>50</pageSize>
<sort/>
<synonym>false</synonym>
</request>
<resultList>
<result>

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

europepmc opendata python xml xml xml xml xml xml xml xml-parsing