与孙子们一起享用美丽的汤选择

问题描述

我已经为此苦苦挣扎了很长时间了。

给出以下XML文件

import tkinter as tk
from tkinter import *
import tkinter.ttk as tkrttk
from PIL import Image,ImageFont,ImageTk
import csv
from tkinter import filedialog

root = tk.Tk()

def select_input_file():
    global input_file_path
    input_file_path = filedialog.askopenfile(filetypes=(("CSV files","*.csv"),))
    with open(input_file_path) as f:
        reader = csv.DictReader(f,delimiter=',')
menubar = Menu(root)
filemenu = Menu(menubar,tearoff=0)
filemenu.add_command(label="Import",command=select_input_file)

root.mainloop()

使用BeautifulSoup,我想出了以下解决方案,以便使用子组合器从entry标签中获取ID。

<?xml version='1.0' encoding='UTF-8'?>
<html>
    <body>
        <feed xml:base="https:newrecipes.org"
            xmlns="http://www.w3.org/2005/Atom"
            xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
            xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
            <id>https://recipes.com</id>
            <title>Cuisine</title>
            <updated>2020-08-10T08:48:56.800Z</updated>
            <link href="Cuisine" rel="self" title="Cuisine"/>
            <entry>
                <id>https://www.cuisine.org(53198770598313985)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313985</d:id>
                        <d:name m:type="Edm.String">American</d:name>
                    </m:properties>
                </content>
            </entry>
            <entry>
                <id>https://www.cuisine.org(53198770598313986)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313986</d:id>
                        <d:name m:type="Edm.String">Asian</d:name>
                    </m:properties>
                </content>
            </entry>
        </feed>
      </body>
     </html>
    

这将返回文件中from bs4 import BeautifulSoup import re # Make a BS object to parse the xml string. xml_soup = BeautifulSoup(xml_string,features="lxml") # Use the child combinator to select the ids that are direct descendants of entry cuisine_ids_unparsed = xml_soup.select("entry > content") # Get the ids from the Tag value using regex. # Then return the first occurrence of the regex found. cuisine_ids = [re.findall(r"\((.*)\)",cuisine_id.text)[0] for cuisine_id in cuisine_ids_unparsed] 括号内的所有美食ID。但我也想访问每个<id>中的properties。因为这些都包含美食的ID和名称,而无需任何解析。 不幸的是,使用css中的Child组合器(>),我无法深入研究,我想知道是否有更好的方法,除了遍历元素以提取值之外。像这样:

entry

检索所有ID和

cuisine_ids_unparsed = xml_soup.select("entry > content > properties > id")

检索所有名称。

解决方法

您可以使用zip()函数将两个标签一起“绑定” :

import re
from bs4 import BeautifulSoup


txt = '''<?xml version='1.0' encoding='UTF-8'?>
<html>
    <body>
        <feed xml:base="https:newrecipes.org"
            xmlns="http://www.w3.org/2005/Atom"
            xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
            xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
            <id>https://recipes.com</id>
            <title>Cuisine</title>
            <updated>2020-08-10T08:48:56.800Z</updated>
            <link href="Cuisine" rel="self" title="Cuisine"/>
            <entry>
                <id>https://www.cuisine.org(53198770598313985)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313985</d:id>
                        <d:name m:type="Edm.String">American</d:name>
                    </m:properties>
                </content>
            </entry>
            <entry>
                <id>https://www.cuisine.org(53198770598313986)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313986</d:id>
                        <d:name m:type="Edm.String">Asian</d:name>
                    </m:properties>
                </content>
            </entry>
        </feed>
      </body>
</html>'''

soup = BeautifulSoup(txt,'xml')


for id_,name in zip(soup.select('entry > id'),soup.select('entry > content > m|properties > d|name')):
    print(re.search(r'\((.*?)\)',id_.text).group(1))
    print(name.text)
    print('-' * 80)

打印:

53198770598313985
American
--------------------------------------------------------------------------------
53198770598313986
Asian
--------------------------------------------------------------------------------
,

使用了@Andrej Kesely的一些建议,但是可以使用正则表达式来代替zip

txt = '''<?xml version='1.0' encoding='UTF-8'?>
<html>
    <body>
        <feed xml:base="https:newrecipes.org"
            xmlns="http://www.w3.org/2005/Atom"
            xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
            xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
            <id>https://recipes.com</id>
            <title>Cuisine</title>
            <updated>2020-08-10T08:48:56.800Z</updated>
            <link href="Cuisine" rel="self" title="Cuisine"/>
            <entry>
                <id>https://www.cuisine.org(53198770598313985)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313985</d:id>
                        <d:name m:type="Edm.String">American</d:name>
                    </m:properties>
                </content>
            </entry>
            <entry>
                <id>https://www.cuisine.org(53198770598313986)</id>
                <category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
                <title></title>
                <updated>1970-01-01T00:00:00.000Z</updated>
                <content type="application/xml">
                    <m:properties>
                        <d:id m:type="Edm.Int64">53198770598313986</d:id>
                        <d:name m:type="Edm.String">Asian</d:name>
                    </m:properties>
                </content>
            </entry>
        </feed>
      </body>
</html>'''


xml_soup = BeautifulSoup(txt,features="xml")

properties_unparsed = xml_soup.select('entry > content > m|properties')

for prop in properties_unparsed:
    # Extract the id and name from the text of the property
    # The id is going to be a sequence of numbers
    # the name a sequence of letters.
    tup = re.match(r'(\d+)(\w+)',prop.text).groups()
    id_ = tup[0]
    name = tup[1]
    print(id_,name)

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...