问题描述
我在一个文件夹中有几个XML文件。它们是系统生成的,并且每晚都会弹出。每晚可能有1到200个地方。结构坚固,永不改变。它们包含的数据比我提供的示例更多,但是时间戳数据足以解决我的问题。
我正在做的是编写一个脚本(下面仅包括我所面临的问题的脚本部分),该脚本将其中的数据放入并将其放入pandas数据框中以进行进一步处理,然后删除文件夹中的文件。
我的XML文件如下:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
</scans>
<scan>
我的脚本如下:
import os
import pandas as pd
import xml.etree.ElementTree as et
path = 'my\\path'
df_cols = ['timestamp']
rows = []
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path,filename)
xtree = et.parse(fullname)
xroot = xtree.getroot()
scans = xroot.find('scans')
scan = scans.findall('scan')
for n in scan:
s_timestamp = n.attrib.get('timestamp')
rows.append({'timestamp': s_timestamp})
out_df = pd.DataFrame(rows,columns = df_cols)
现在,如果我print(s_timestamp)
得到:
20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41
这是我希望附加后的数据框所包含的内容。但是如果我print(rows)
会得到这个:
[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:22:55'},{'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'},{'timestamp': '20200909T08:29:27'},{'timestamp': '20200909T08:41:41'}]
因此,我print(out_df)
时也得到了四个结果:
timestamp
0 20200909T08:13:42
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
3 20200909T08:41:41
我想要的结果是:
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
3 20200909T08:41:41
我了解到循环和追加中的某些原因导致了这种情况,但是我看不出为什么会发生这种情况。
解决方法
使用以下行一次创建df:
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path,filename)
xtree = et.parse(fullname)
xroot = xtree.getroot()
scans = xroot.find('scans')
scan = scans.findall('scan')
for n in scan:
s_timestamp = n.attrib.get('timestamp')
rows.append({'timestamp': s_timestamp})
out_df = pd.DataFrame(rows,columns = df_cols)