如何将多个xml文件中的属性值解析到一个熊猫数据帧？

问题描述

我在一个文件夹中有几个XML文件。它们是系统生成的，并且每晚都会弹出。每晚可能有1到200个地方。结构坚固，永不改变。它们包含的数据比我提供的示例更多，但是时间戳数据足以解决我的问题。

我正在做的是编写一个脚本（下面仅包括我所面临的问题的脚本部分），该脚本将其中的数据放入并将其放入pandas数据框中以进行进一步处理，然后删除文件夹中的文件。

我的XML文件如下：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
    </scans>
<scan>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
    </scans>
<scan>


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
    </scans>
<scan>


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
    </scans>
<scan>

我的脚本如下：

import os
import pandas as pd 
import xml.etree.ElementTree as et 

path = 'my\\path'
df_cols = ['timestamp']
rows = []

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path,filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
            out_df = pd.DataFrame(rows,columns = df_cols)

现在，如果我print(s_timestamp)得到：

20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41

这是我希望附加后的数据框所包含的内容。但是如果我print(rows)会得到这个：

[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:22:55'},{'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:29:27'},{'timestamp': '20200909T08:41:41'}]

因此，我print(out_df)时也得到了四个结果：

              timestamp
0     20200909T08:13:42
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

我想要的结果是：

              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

我了解到循环和追加中的某些原因导致了这种情况，但是我看不出为什么会发生这种情况。

解决方法

使用以下行一次创建df：

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path,filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
out_df = pd.DataFrame(rows,columns = df_cols)

elementtree pandas python python-3.x xml xml xml xml xml xml