问题描述
我正在尝试解析数据密集型xml文件。我正在使用lxml解析每个标签:
from lxml import etree
sourceFile=sys.argv[1]
events = ("start","end")
context=etree.iterparse(sourceFile,events=events)
for eachEvent,eachElement in context:
<the code goes here>
我面临以下数据的问题:
<QualityData>
<Measure>Care for Older Adults - Functional Status Assessment</Measure>
<Question>Patients,ages 66 years or older,should have a functional status assessment completed every calendar year.</Question>
<Answer>Date:12/31/2019</Answer>
<SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition,event,or body system) today</SubAnswer2>
<Measure>Care for Older Adults - Pain Screening</Measure>
<Question>Patients,should have a pain assessment at least annually</Question>
<Answer>Date:12/31/2019</Answer>
<SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today</SubAnswer2>
</QualityData>
标签SubAnswer2中有2次出现。第二次出现的是无值。需要注意的是,第二次出现的其他标签正在正确读取。另外,我仅对此数据有疑问。还有其他示例,其中标签SubAnswer2多次出现,并且已成功解析它们。 我用来读取Subanswer2值的代码是:
if eachElement.tag=='SubAnswer2' and QDstart and eachEvent=='start':
QDlist.append(eachElement.text)
我也尝试使用ElementTree进行解析。但是,当我使用其他标签时,我没有标签。为了调试该问题,我编写了一个简单的数据解析和打印程序。看起来对于丢失的数据,当事件为“结束”时,eacherElement.text会获取值。我用来打印数据的代码:
for eachEvent,eachElement in context:
print(eachElement.tag,eachEvent,eachElement.text,sep='::')
我得到的输出:
QualityData::start::None
Measure::start::Care for Older Adults - Functional Status Assessment
Measure::end::Care for Older Adults - Functional Status Assessment
Question::start::Patients,should have a functional status assessment completed every calendar year.
Question::end::Patients,should have a functional status assessment completed every calendar year.
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today
SubAnswer2::end::Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today
Measure::start::Care for Older Adults - Pain Screening
Measure::end::Care for Older Adults - Pain Screening
Question::start::Patients,should have a pain assessment at least annually
Question::end::Patients,should have a pain assessment at least annually
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::None
SubAnswer2::end::Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today
QualityData::end::None
当事件为“结束”时,观察SubAnswer2的文本。我可以做些什么来确保在事件“开始”时出现标签文本?
谢谢。
解决方法
使用etree感觉太复杂了。这是另一种方式
from simplified_scrapy import SimplifiedDoc
html = '''
<QualityData>
<Measure>Care for Older Adults - Functional Status Assessment</Measure>
<Question>Patients,ages 66 years or older,should have a functional status assessment completed every calendar year.</Question>
<Answer>Date:12/31/2019</Answer>
<SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition,event,or body system) today</SubAnswer2>
<Measure>Care for Older Adults - Pain Screening</Measure>
<Question>Patients,should have a pain assessment at least annually</Question>
<Answer>Date:12/31/2019</Answer>
<SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today</SubAnswer2>
</QualityData>
'''
doc = SimplifiedDoc(html)
nodes = doc.QualityData.children
for node in nodes:
print('{:<15}{}'.format(node.tag,node.text))
# Or
print (doc.SubAnswer2s.text)
结果:
Measure Care for Older Adults - Functional Status Assessment
Question Patients,should have a functional status assessment completed every calendar year.
Answer Date:12/31/2019
SubAnswer2 Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today
Measure Care for Older Adults - Pain Screening
Question Patients,should have a pain assessment at least annually
Answer Date:12/31/2019
SubAnswer2 Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today
['Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today','Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today']
还有更多示例:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples