问题描述
urllib.robotparser.RobotFileParser() 每次运行都会给我不同的结果。
https://www.alza.cz/robots.txt 这样说 - 不允许 /search.htm*
# robots.txt for https://www.alza.cz/
User-Agent: *
disallow: /Order1.htm
disallow: /Order2.htm
disallow: /Order3.htm
disallow: /Order4.htm
disallow: /Order5.htm
disallow: /download/
disallow: /muj-ucet/
disallow: /Secure/
disallow: /LostPassword.htm
disallow: /search.htm*
Sitemap: https://www.alza.cz/_sitemap-categories.xml
Sitemap: https://www.alza.cz/_sitemap-categories-producers.xml
Sitemap: https://www.alza.cz/_sitemap-live-product.xml
Sitemap: https://www.alza.cz/_sitemap-dead-product.xml
Sitemap: https://www.alza.cz/_sitemap-before_listing.xml
Sitemap: https://www.alza.cz/_sitemap-SEO-sorted-categories.xml
Sitemap: https://www.alza.cz/_sitemap-bazaar-categories.xml
Sitemap: https://www.alza.cz/_sitemap-sale-categories.xml
Sitemap: https://www.alza.cz/_sitemap-parametrically-generated-pages.xml
Sitemap: https://www.alza.cz/_sitemap-parametrically-generated-pages-producer.xml
Sitemap: https://www.alza.cz/_sitemap-articles.xml
Sitemap: https://www.alza.cz/_sitemap-producers.xml
Sitemap: https://www.alza.cz/_sitemap-econtent.xml
Sitemap: https://www.alza.cz/_sitemap-dead-econtent.xml
Sitemap: https://www.alza.cz/_sitemap-branch-categories.xml
Sitemap: https://www.alza.cz/_sitemap-installments.xml
Sitemap: https://www.alza.cz/_sitemap-detail-page-slots-of-accessories.xml
Sitemap: https://www.alza.cz/_sitemap-reviews.xml
Sitemap: https://www.alza.cz/_sitemap-detail-page-bazaar.xml
Sitemap: https://www.alza.cz/_sitemap-productgroups.xml
Sitemap: https://www.alza.cz/_sitemap-accessories.xml
但是,当我第一次运行以下命令时,我得到了 FALSE(这是正确的),但现在每次我运行它时我都会得到 TRUE(这是不正确的):
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.alza.cz/robots.txt")
rp.read()
rp.can_fetch("*","https://www.alza.cz/search.htm?exps=asdf")
从源代码中找到了这个片段,它表明服务器以 400 到 499 之间的 http 状态代码进行响应,这绝对很奇怪,不幸的是我无法自己检查。
def read(self):
"""Reads the robots.txt URL and Feeds it to the parser."""
try:
f = urllib.request.urlopen(self.url)
except urllib.error.HTTPError as err:
if err.code in (401,403):
self.disallow_all = True
elif err.code >= 400 and err.code < 500:
self.allow_all = True
else:
raw = f.read()
self.parse(raw.decode("utf-8").splitlines())
# Until the robots.txt file has been read or found not
# to exist,we must assume that no url is allowable.
# This prevents false positives when a user erroneously
# calls can_fetch() before calling read().
对可能发生的事情有什么想法吗?
编辑:我更新了源代码并且没有错误状态,它给出了 200。我没有看到给 url 传递的原因。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)