BeautifulSoup:查找嵌套标签

问题描述

我对此颇为困惑:

<span>Alpha<span class="class_xyz">Beta</span></span>

我尝试仅刮取第一个跨度文本“ Alpha”(不包括第二个嵌套的“ Beta”)。 你会怎么做?

我正在尝试编写一个函数来查找没有类属性的所有Span标记,但是某些方法不起作用...

谢谢。

解决方法

一种处理方式:

from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()

输出:

'Alpha'
,

这是获取没有类属性的每个Span标签文本的另一种方法:

from bs4 import BeautifulSoup

html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""

soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
    target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
    out.append(target[i].text.strip())

print(out)

输出:

['Alpha','Gamma','Epsilon']

或者如果您需要整个span标签:

from bs4 import BeautifulSoup

html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""

soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
    target[i].decompose()
out = soup.select("span")

print(out)

输出:

[<span>Alpha</span>,<span>Gamma</span>,<span>Epsilon</span>]

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...