问题描述
我用Python编写了一个程序,用于在Wikipedia中抓取查询的第一个图像链接
类似该图像的东西:
我的Python程序需要以下库:
- 请求
- bs4
- html
- re
当我运行我的代码时,我给出一个参数,它返回定义的错误(“图像未找到”)。请帮我解决问题。
我的Python程序源代码:
import requests
import bs4
import re
import html
# Create the parser
my_parser = argparse.ArgumentParser(description='Wikipedia Image Grabber')
# Add the arguments
my_parser.add_argument('Phrase',Metavar='Phrase',type=str,help='Phrase to Search')
# Execute the parse_args() method
args = my_parser.parse_args()
Phrase = args._get_kwargs()[0][1]
if '.' in Phrase or '-' in Phrase:
if '.' in Phrase and '-' in Phrase:
Phrase = str(Phrase).replace('-',' ')
elif '-' in Phrase and not '.' in Phrase:
Phrase = str(Phrase).replace('-',' ')
Phrase = html.escape(Phrase)
request = requests.get('https://fa.wikipedia.org/wiki/Special:Search?search=%s&go=Go&ns0=1' % Phrase).text
parser = bs4.BeautifulSoup(request,'html.parser')
none_search_finder = parser.find_all('p',attrs = {'class':'mw-search-nonefound'})
if len(none_search_finder)==1:
print('No-Result')
exit()
else:
search_results = parser.find_all('div',attrs = {'class':'mw-search-result-heading'})
if len(search_results)==0:
search_result = parser.find_all('h1',attrs = {'id':'firstheading'})
if len(search_result)!=0:
link = 'https://fa.wikipedia.org/wiki/'+str(Phrase)
else:
print('Result-Error')
exit()
else:
selected_result = search_results[0]
regex_exp = r".*<a href=\"(.*)\" title="
regex_get_uri = re.findall(regex_exp,str(selected_result))
regex_result = str(regex_get_uri[0])
link = 'https://fa.wikipedia.org'+regex_result
#---------------
second_request = requests.get(link)
second_request_source = second_request.text
second_request_parser = bs4.BeautifulSoup(second_request_source,'html.parser')
image_finder = second_request_parser.find_all('a',attrs = {'class':'image'})
if len(image_finder) == 0:
print('No-Image')
exit()
else:
image_finder_e = image_finder[0]
second_regex = r".*src=\"(.*)\".*decoding=\"async\""
regex_finder = re.findall(second_regex,str(image_finder_e))
if len(regex_finder)!=0:
regexed_uri = str(regex_finder[0])
img_link = regexed_uri.replace('//','https://')
print(img_link)
else:
print("Image-Not-Found")
解决方法
您可以在不使用正则表达式的情况下执行此操作,并且代码不起作用的原因是浏览器和响应中decoding = "async"
的位置不同。
这是不使用正则表达式的解决方案。
import re
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Google'
soup = BeautifulSoup(requests.get(url).text,'html.parser')
imglinks = soup.find_all('a',attrs = {'class':'image'})[0]
for img in imglinks.find_all('img'):
print(img['src'].replace('//','https://'))
输出:
https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Google_2015_logo.svg/196px-Google_2015_logo.svg.png