问题描述
我正在尝试抓取此网站 Link。 我正在尝试抓取这个特定的部分,请在下面找到 HTML:
<div style="padding:20px;">
<h1>
ABDULLA SALEM CONTRACTING EST
</h1>
<strong>
<a href="directory/umm-al-quwain/umm-al-quwain/building-contractors.html" title="Building
Contractors in Umm Al Quwain">
Building Contractors
</a>
</strong>
<br> P.O. Box: 200
<br> Location: Umm Al Quwain
<br> Phone: 06-7655445
</div>
import requests
import re
import csv
from bs4 import BeautifulSoup
def comp_links():
url=requests.get("https://www.uae-business-directory.com/directory/umm-al-quwain/umm-al-quwain/building-contractors.html").text
soup=BeautifulSoup(url,'lxml')
links=soup.find_all('a',attrs={'href': re.compile("^directory/umm-al-quwain/umm-al-quwain/building-contractors/")})
return links
def comp_details(z):
filename='comp.csv'
f=open(filename,'w')
music=csv.writer(f)
a=[]
def email_format():
if 'E-Mail' in details.text:
mail=details.img['src']
email=mail.replace('typo3temp/GB/','').replace('%40','@').split('_')[0]
return email
for i in z:
comp=requests.get('https://www.uae-business-directory.com/'+i['href']).text
soup_comp=BeautifulSoup(comp,'lxml')
details=soup_comp.find('div',class_='details')
for i in details:
print(i.text)
music.writerow([i.get_text(),email_format()]) #Writing to CSV
z=comp_links()
comp_details(z)
输出是这样的:
ABDULLA SALEM CONTRACTING ESTBuilding ContractorsP.O.邮箱:200位置:Umm Al Quwain电话:06-7655445
我怎么能这样:
- 阿卜杜拉塞勒姆承包 EST
- 建筑承包商
- 邮政信箱箱:200
- 地点:乌姆盖万
- 电话:06-7655445
解决方法
试试:
import requests
from bs4 import BeautifulSoup
url = "https://www.uae-business-directory.com/directory/umm-al-quwain/umm-al-quwain/building-contractors/abdulla-salem-contracting-est.html"
soup = BeautifulSoup(requests.get(url).content,"html.parser")
print(soup.h1.parent.get_text(strip=True,separator="\n"))
打印:
ABDULLA SALEM CONTRACTING EST
Building Contractors
P.O. Box: 200
Location: Umm Al Quwain
Phone: 06-7655445
,
因为标签有 scrapy
,你可以试试这个:
details = response.css(".details ::text").getall()
这将获取 div
中的整个 details
。
经过检查,details
的结构类似于:
['\n','\n','<!--\ngoogle_ad_client = "ca-pub-7955553446826172";\ngoogle_ad_slot = "2007388357";\ngoogle_ad_width = 300;\ngoogle_ad_height = 600;\n//-->\n','ABDULLA SALEM CONTRACTING EST','Building Contractors','P.O. Box: 200','Location: Umm Al Quwain','Phone: 06-7655445']
您可以使用 details[-5:]
获取子数组。它返回
['ABDULLA SALEM CONTRACTING EST','Phone: 06-7655445']