标记中的xml解析结果,某些标签没有值

问题描述

我正在尝试解析数据密集型xml文件。我正在使用lxml解析每个标签:

from lxml import etree
sourceFile=sys.argv[1]
events = ("start","end")
context=etree.iterparse(sourceFile,events=events)
for eachEvent,eachElement in context:
    <the code goes here>

我面临以下数据的问题:

<QualityData>
        <Measure>Care for Older Adults - Functional Status Assessment</Measure>
        <Question>Patients,ages 66 years or older,should have a functional status assessment completed every calendar year.</Question>
        <Answer>Date:12/31/2019</Answer>
        <SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition,event,or body system) today</SubAnswer2>
        <Measure>Care for Older Adults - Pain Screening</Measure>
        <Question>Patients,should have a pain assessment at least annually</Question>
        <Answer>Date:12/31/2019</Answer>
        <SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today</SubAnswer2>
    </QualityData>

标签SubAnswer2中有2次出现。第二次出现的是无值。需要注意的是,第二次出现的其他标签正在正确读取。另外,我仅对此数据有疑问。还有其他示例,其中标签SubAnswer2多次出现,并且已成功解析它们。 我用来读取Subanswer2值的代码是:

if eachElement.tag=='SubAnswer2' and QDstart and eachEvent=='start':
    QDlist.append(eachElement.text)

我也尝试使用ElementTree进行解析。但是,当我使用其他标签时,我没有标签。为了调试该问题,我编写了一个简单的数据解析和打印程序。看起来对于丢失的数据,当事件为“结束”时,eacherElement.text会获取值。我用来打印数据的代码:

for eachEvent,eachElement in context:
print(eachElement.tag,eachEvent,eachElement.text,sep='::')

我得到的输出:

QualityData::start::None
Measure::start::Care for Older Adults - Functional Status Assessment
Measure::end::Care for Older Adults - Functional Status Assessment
Question::start::Patients,should have a functional status assessment completed every calendar year.
Question::end::Patients,should have a functional status assessment completed every calendar year.
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today
SubAnswer2::end::Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today
Measure::start::Care for Older Adults - Pain Screening
Measure::end::Care for Older Adults - Pain Screening
Question::start::Patients,should have a pain assessment at least annually
Question::end::Patients,should have a pain assessment at least annually
Answer::start::Date:12/31/2019
Answer::end::Date:12/31/2019
SubAnswer2::start::None
SubAnswer2::end::Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today
QualityData::end::None

当事件为“结束”时,观察SubAnswer2的文本。我可以做些什么来确保在事件“开始”时出现标签文本?

谢谢。

解决方法

使用etree感觉太复杂了。这是另一种方式

from simplified_scrapy import SimplifiedDoc
html = '''
<QualityData>
    <Measure>Care for Older Adults - Functional Status Assessment</Measure>
    <Question>Patients,ages 66 years or older,should have a functional status assessment completed every calendar year.</Question>
    <Answer>Date:12/31/2019</Answer>
    <SubAnswer2>Completed comprehensive functional status assessment (not limited to an acute or single condition,event,or body system) today</SubAnswer2>
    <Measure>Care for Older Adults - Pain Screening</Measure>
    <Question>Patients,should have a pain assessment at least annually</Question>
    <Answer>Date:12/31/2019</Answer>
    <SubAnswer2>Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today</SubAnswer2>
</QualityData>
'''
doc = SimplifiedDoc(html)
nodes = doc.QualityData.children
for node in nodes:
    print('{:<15}{}'.format(node.tag,node.text))

# Or
print (doc.SubAnswer2s.text)

结果:

Measure        Care for Older Adults - Functional Status Assessment
Question       Patients,should have a functional status assessment completed every calendar year.
Answer         Date:12/31/2019
SubAnswer2     Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today
Measure        Care for Older Adults - Pain Screening
Question       Patients,should have a pain assessment at least annually
Answer         Date:12/31/2019
SubAnswer2     Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today
['Completed comprehensive functional status assessment (not limited to an acute or single condition,or body system) today','Comprehensive pain assessment (not limited to an acute or single condition,or body system) completed today']

还有更多示例:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...