问题描述
import scrapy
import logging
class Countriesspider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self,response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()
# absolute_url = f"https://www.worldometers.info{link}"
# absolute_url = response.urljoin(link)
yield response.follow(url=link,callback=self.parse_country,Meta={'country_name':name})
def parse_country(self,response):
name = response.request.Meta['country_name']
rows = response.xpath("(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1])[1]/tbody/tr")
for row in rows:
year = row.xpath(".//td[1]/text()").get()
population = row.xpath(".//td[2]/strong/text()").get()
yield {
'year': year,'population':population
}
但我收到错误
(new_Virtual_workspace) SubhrajyotisAir:worldometer subhrajyotisaha$ scrapy crawl countries
2021-05-29 23:33:14 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: worldometer)
2021-05-29 23:33:14 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.5.2,w3lib 1.21.0,Twisted 21.2.0,Python 3.8.10 (default,May 19 2021,11:01:55) - [Clang 10.0.0 ],pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021),cryptography 3.4.7,Platform macOS-10.14.1-x86_64-i386-64bit
2021-05-29 23:33:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-05-29 23:33:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'worldometer','NEWSPIDER_MODULE': 'worldometer.spiders','ROBOTSTXT_OBEY': True,'SPIDER_MODULES': ['worldometer.spiders']}
2021-05-29 23:33:14 [scrapy.extensions.telnet] INFO: Telnet Password: 87f0a20eef9428d7
2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']
2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-05-29 23:33:14 [scrapy.core.engine] INFO: Spider opened
2021-05-29 23:33:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2021-05-29 23:33:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-05-29 23:33:18 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.worldometers.info/robots.txt> (referer: None)
2021-05-29 23:33:18 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-05-29 23:33:18 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2021-05-29 23:33:18 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2021-05-29 23:33:18 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2021-05-29 23:33:18 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2021-05-29 23:33:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)
2021-05-29 23:33:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/ethiopia-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)
2021-05-29 23:33:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.worldometers.info/world-population/ethiopia-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)
Traceback (most recent call last):
File "/Users/subhrajyotisaha/opt/anaconda3/envs/new_Virtual_workspace/lib/python3.8/site-packages/parsel/selector.py",line 236,in xpath
result = xpathev(query,namespaces=nsp,File "src/lxml/etree.pyx",line 1582,in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi",line 305,in lxml.etree.XpathelementEvaluator.__call__
File "src/lxml/xpath.pxi",line 225,in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression
我正在使用 conda 虚拟工作区环境和 vs 代码 - macos。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)