问题描述
如何在嵌套跨度之后提取后面的文本?特别, 我正在尝试通过网络抓取读取2007-05-02的文本元素
<td style="width: 33%"><span class="label">Start Date<span class="info-tip startdatetip">*</span>:</span> 2007-05-02</td>
我的代码给我一个AttributeError:'NoneType'对象没有属性'next_sibling'
from bs4 import BeautifulSoup
import urllib.request
import csv
source = urllib.request.urlopen('https://www.clinicaltrialsregister.eu/ctr-search/search?
query=&page=1').read()
soup = BeautifulSoup(source,'lxml')
Start_date=soup.find('span',{'class':'label'},text = 'Start Date').next_sibling
print(Start_date)
或者,我尝试了下面的代码,该代码没有任何参考
Start_date=soup.find('span',{'class':'info-tip startdatetip'}).next_sibling.next_sibling
print(Start_date)
解决方法
您可以在stripped_strings
上使用td
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=&page=1',verify=False)
soup = BeautifulSoup(source.text,'lxml')
table = soup.find("div",class_="results grid_8plus")
first_table = table.find_all("table",class_="result")[0]
start_date = list(first_table.find("tr").find_all("td")[-1].stripped_strings)[-1]
print(start_date)
输出:
2007-05-02
OR
通过使用next_sibling
start_date = first_table.find("tr").find_all("td")[-1].find("span").next_sibling.strip()
print(start_date)