如何将多个xml文件中的属性值解析到一个熊猫数据帧?

问题描述

我在一个文件夹中有几个XML文件。它们是系统生成的,并且每晚都会弹出。每晚可能有1到200个地方。结构坚固,永不改变。它们包含的数据比我提供的示例更多,但是时间戳数据足以解决我的问题。

我正在做的是编写一个脚本(下面仅包括我所面临的问题的脚本部分),该脚本将其中的数据放入并将其放入pandas数据框中以进行进一步处理,然后删除文件夹中的文件

我的XML文件如下:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
    </scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
    </scans>
<scan>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
    </scans>
<scan>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
    </scans>
<scan>

我的脚本如下:

import os
import pandas as pd 
import xml.etree.ElementTree as et 

path = 'my\\path'
df_cols = ['timestamp']
rows = []

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path,filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
            out_df = pd.DataFrame(rows,columns = df_cols)

现在,如果我print(s_timestamp)得到:

20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41

这是我希望附加后的数据框所包含的内容。但是如果我print(rows)会得到这个:

[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:22:55'},{'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:29:27'},{'timestamp': '20200909T08:41:41'}]

因此,我print(out_df)时也得到了四个结果:

              timestamp
0     20200909T08:13:42
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

我想要的结果是:

              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

我了解到循环和追加中的某些原因导致了这种情况,但是我看不出为什么会发生这种情况。

解决方法

使用以下行一次创建df:

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path,filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
out_df = pd.DataFrame(rows,columns = df_cols)