问题描述
我对此颇为困惑:
<span>Alpha<span class="class_xyz">Beta</span></span>
我尝试仅刮取第一个跨度文本“ Alpha”(不包括第二个嵌套的“ Beta”)。 你会怎么做?
我正在尝试编写一个函数来查找没有类属性的所有Span标记,但是某些方法不起作用...
谢谢。
解决方法
一种处理方式:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
输出:
'Alpha'
,
这是获取没有类属性的每个Span标签文本的另一种方法:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
输出:
['Alpha','Gamma','Epsilon']
或者如果您需要整个span标签:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
输出:
[<span>Alpha</span>,<span>Gamma</span>,<span>Epsilon</span>]