使用 Scrapy Python 提取数据时出错

问题描述

import scrapy
import logging

class Countriesspider(scrapy.Spider):
    name = 'countries'
    allowed_domains = ['www.worldometers.info']
    start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
    def parse(self,response):
        countries = response.xpath("//td/a")
        for country in countries:
        name = country.xpath(".//text()").get()
        link = country.xpath(".//@href").get()
    
        # absolute_url = f"https://www.worldometers.info{link}"
        # absolute_url = response.urljoin(link)

        yield response.follow(url=link,callback=self.parse_country,Meta={'country_name':name})

def parse_country(self,response):
    name = response.request.Meta['country_name']
    rows = response.xpath("(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1])[1]/tbody/tr")
    for row in rows:
        year = row.xpath(".//td[1]/text()").get()
        population = row.xpath(".//td[2]/strong/text()").get()
        yield {
            'year': year,'population':population
        }

但我收到错误

(new_Virtual_workspace) SubhrajyotisAir:worldometer subhrajyotisaha$ scrapy crawl countries

2021-05-29 23:33:14 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: worldometer)

2021-05-29 23:33:14 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.5.2,w3lib 1.21.0,Twisted 21.2.0,Python 3.8.10 (default,May 19 2021,11:01:55) - [Clang 10.0.0 ],pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021),cryptography 3.4.7,Platform macOS-10.14.1-x86_64-i386-64bit

2021-05-29 23:33:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

2021-05-29 23:33:14 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'worldometer','NEWSPIDER_MODULE': 'worldometer.spiders','ROBOTSTXT_OBEY': True,'SPIDER_MODULES': ['worldometer.spiders']}

2021-05-29 23:33:14 [scrapy.extensions.telnet] INFO: Telnet Password: 87f0a20eef9428d7

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']

2021-05-29 23:33:14 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2021-05-29 23:33:14 [scrapy.core.engine] INFO: Spider opened

2021-05-29 23:33:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)

2021-05-29 23:33:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2021-05-29 23:33:18 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.worldometers.info/robots.txt> (referer: None)

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.

2021-05-29 23:33:18 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.

2021-05-29 23:33:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)

2021-05-29 23:33:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/ethiopia-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)

2021-05-29 23:33:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.worldometers.info/world-population/ethiopia-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)

Traceback (most recent call last):

  File "/Users/subhrajyotisaha/opt/anaconda3/envs/new_Virtual_workspace/lib/python3.8/site-packages/parsel/selector.py",line 236,in xpath

    result = xpathev(query,namespaces=nsp,File "src/lxml/etree.pyx",line 1582,in lxml.etree._Element.xpath

  File "src/lxml/xpath.pxi",line 305,in lxml.etree.XpathelementEvaluator.__call__

  File "src/lxml/xpath.pxi",line 225,in lxml.etree._XPathEvaluatorBase._handle_result

lxml.etree.XPathEvalError: Invalid expression

我正在使用 conda 虚拟工作区环境和 vs 代码 - macos。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）