问题描述
我是XML的新手,有什么有效的方法可以使用pandas数据框匹配文本并更新XML文件?
这是我的大型XML文件的一小部分,仍然遵循适当的格式。
XML文件(input.xml):
<?xml version="1.0" encoding="UTF-8"?>
<brand by="hhdhdh" date="2014/01/01" name="OOP-112200" Insti="TGA">
<design name="OOP-112200" own="TGA" descri="" sound_db="JJKO">
<sec name="abcd" sound_freq="abcd" c_ty="pv">
<feature number="48">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="abcd" first_name="g7tty" description="xyz" />
</sec>
<sec name="M_20_K40745170" sound_freq="mhr17:7907527-7907589" tension="SGCGSCGSCGSCGSC" s_c="0">
<feature number="5748">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="mhr17:7907527-7907589" first_name="g7tty" description="xyz">
</mwan>
</sec>
<sec name="M_20_K40745171" sound_freq="mhr17:7907528-7907599" tension="SGCGSCGSCGSHHGSC" s_c="0">
<feature number="5748">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="mhr17:7907527-7907589" first_name="gtftty" description="xyz">
<xyz abc="trt" id="abc" />
<per fre="acc" value="abc" />
<per fre="xyz" value="abc" />
<per fre="yy" value="abc" />
</mwan>
</sec>
#file continue....
</design>
</brand>
数据框(用作输入):
name Volum_5mb Volum_40mb Volum_70mb
1 M_20_K40745170 89.00 44.00 77.00
2 M_20_K40745171 77.00 65.00 94.00
我想匹配name
列中的元素,如果匹配,则使用其余列来创建新属性,如下所示。例如,如果存在/匹配M_20_K40745170
中的元素(df['name']
),则分别在输出文件中使用以下几行来更新相应的节点。
<per fre="Volum_5mb" value="89.00"/>
<per fre="Volum_40mb" value="44.00"/>
<per fre="Volum_70mb" value="77.00"/>
以此类推。
所需的XML (output.xml):
<?xml version="1.0" encoding="UTF-8"?>
<brand by="hhdhdh" date="2014/01/01" name="OOP-112200" Insti="TGA">
<design name="OOP-112200" own="TGA" descri="" sound_db="JJKO">
<sec name="abcd" sound_freq="abcd" c_ty="pv">
<feature number="48">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="abcd" first_name="g7tty" description="xyz" />
</sec>
<sec name="M_20_K40745170" sound_freq="mhr17:7907527-7907589" tension="SGCGSCGSCGSCGSC" s_c="0">
<feature number="5748">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="mhr17:7907527-7907589" first_name="g7tty" description="xyz">
<per fre="Volum_5mb" value="89.00" />
#new attribute FYI
<per fre="Volum_40mb" value="44.00" />
#new attribute FYI
<per fre="Volum_70mb" value="77.00" />
#new attribute FYI
</mwan>
</sec>
<sec name="M_20_K40745171" sound_freq="mhr17:7907528-7907599" tension="SGCGSCGSCGSHHGSC" s_c="0">
<feature number="5748">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="mhr17:7907527-7907589" first_name="gtftty" description="xyz">
<xyz abc="trt" id="abc" />
<per fre="acc" value="abc" />
<per fre="xyz" value="abc" />
<per fre="yy" value="abc" />
<per fre="Volum_5mb" value="77.00" />
#new attribute FYI
<per fre="Volum_40mb" value="65.00" />
#new attribute FYI
<per fre="Volum_70mb" value="94.00" />
#new attribute FYI
</mwan>
</sec>
#file continue....
</design>
</brand>
我正在尝试etree.ElementTree模块
import xml.etree.ElementTree as ET
tree = ET.parse('input.xml')
root = tree.getroot()
for i in range(len(df)):
for node in tree.findall("./design/sec"):
name = node.attrib.get('name')
if name == df.loc[i,'name']:
print(name)
我是这个Python-XML编码的新手。我不知道如何通过使用pandas数据框架在XML文件中添加新属性。 请帮忙。 谢谢和问候。
解决方法
您可以学习xml
和xpath
,因为主要问题与pandas
无关,而与xml
无关。
您可以使用更复杂的xpath
查找名称为M_20_K40745170
的节点和子节点mwam
,在其中您必须搜索pre
并对其进行更新(甚至添加新的pre
)
node = root.find('./design/sec[@name="M_20_K40745170"]//mwan')
您可以为此使用df.iterrows()
for index,row in df.iterrows():
node = root.find('./design/sec[@name="{}"]//mwan'.format(row['name']))
然后您可以用per
搜索"Volum_5mb"
item = node.find('./per[@fre="Volum_5mb"]')
并创建一个新值和/或更新值
if not item: # if item is None:
item = ET.SubElement(node,'per')
item.set('fre',"Volum_5mb")
item.set('value',str(row['Volum_5mb']))
您可以为此使用列表['Volum_5mb','Volum_40mb','Volum_70mb']
for fre in ['Volum_5mb','Volum_70mb']:
item = node.find('./per[@fre="{}"]'.format(fre))
#print(fre,item)
if not item:
item = ET.SubElement(node,'per')
item.set('fre',fre)
item.set('value',str(row[fre]))
使用示例数据的最小工作代码直接在代码中,但是您应该从文件中读取它们。
text = ''' name Volum_5mb Volum_40mb Volum_70mb
1 M_20_K40745170 89.00 44.00 77.00
2 M_20_K40745171 77.00 65.00 94.00'''
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<brand by="hhdhdh" date="2014/01/01" name="OOP-112200" Insti="TGA">
<design name="OOP-112200" own="TGA" descri="" sound_db="JJKO">
<sec name="abcd" sound_freq="abcd" c_ty="pv">
<feature number="48">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="abcd" first_name="g7tty" description="xyz" />
</sec>
<sec name="M_20_K40745170" sound_freq="mhr17:7907527-7907589" tension="SGCGSCGSCGSCGSC" s_c="0">
<feature number="5748">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="mhr17:7907527-7907589" first_name="g7tty" description="xyz">
</mwan>
</sec>
<sec name="M_20_K40745171" sound_freq="mhr17:7907528-7907599" tension="SGCGSCGSCGSHHGSC" s_c="0">
<feature number="5748">
<tfgt v="0.1466469683747654" y="0.0" units="sec" />
</feature>
<mwan sound_freq="mhr17:7907527-7907589" first_name="gtftty" description="xyz">
<xyz abc="trt" id="abc" />
<per fre="acc" value="abc" />
<per fre="xyz" value="abc" />
<per fre="yy" value="abc" />
</mwan>
</sec>
</design>
</brand>'''
import pandas as pd
import io
import xml.etree.ElementTree as ET
#df = pd.read_csv('input.csv')
df = pd.read_csv(io.StringIO(text),sep='\s+')
#print(df)
#tree = ET.('input.xml')
#root = tree.getroot()
root = ET.fromstring(xml)
tree = ET.ElementTree(root)
for index,row in df.iterrows():
node = root.find('./design/sec[@name="{}"]//mwan'.format(row['name']))
for fre in ['Volum_5mb','Volum_70mb']:
item = node.find('./per[@fre="{}"]'.format(fre))
#print('item:',fre,'=',item)
if not item:
#print('new',item,fre)
item = ET.SubElement(node,'per')
#item.tail = '\n ' # to pretty print
item.set('fre',fre)
item.set('value',str(row[fre]))
#print(ET.tostring(node).decode())
#---
print( ET.tostring(root) )
#tree.write('output.xml')