使用 Scrapy Files Pipeline 在几年内下载PDF文档

问题描述

我正在尝试从以下位置下载 pdf 表格：https://apps.irs.gov/app/picklist/list/priorFormPublication.html

我想下载某个年份范围内（例如 2018-2020 年）可用的所有 PDF 文件。下载的PDF文件应进入表格名称目录，文件名应为“表格名称-年份”（例如：表格W-2/表格W-2-2020.pdf）。

我不确定我做错了什么，但我无法下载文件。

** pdf.py **

import scrapy

class PdfSpider(scrapy.Spider):
    name = 'pdfSpider'
    start_urls = [
          'https://apps.irs.gov/app/picklist/list/priorFormpublication.html',]

    def parse(self,response):
        for link in response.css('.LeftCellSpacer').xpath('@href').extract():
            url = response.url
            path = response.css('a::text').extract()
            next_link = response.urljoin(link)
            yield scrapy.Request(next_link,callback=self.parse_det,Meta={'url': url,'path': path})

    def parse_det(self,response):

        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'path':response.Meta['path'],'file_urls': [extract_with_css('a::attr(href)')],'url':response.Meta['url']
        }


from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0','ITEM_PIPELInes': {'scrapy.pipelines.files.FilesPipeline': 1},'FILES_STORE': '.',})
c.crawl(PdfSpider)

** settings.py **

BOT_NAME = 'taxform_scraper'

SPIDER_MODULES = ['taxform_scraper.spiders']
NEWSPIDER_MODULE = 'taxform_scraper.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELInes = {
   'taxform_scraper.pipelines.TaxformScraperPipeline': 300,'scrapy.pipelines.files.FilesPipeline': 1
}

MEDIA_ALLOW_REDIRECTS = True

当我运行scrapy命令scrapy runspider pdf.py时，我得到以下终端输出

** 终端输出 **

2021-02-03 19:40:29 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-02-03 19:40:29 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.9.1 (default,Dec 24 2020,16:23:16) - [Clang 12.0.0 (clang-1200.0.32.28)],pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020),cryptography 3.3.1,Platform macOS-11.1-x86_64-i386-64bit
2021-02-03 19:40:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-02-03 19:40:29 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0'}
2021-02-03 19:40:29 [scrapy.extensions.telnet] INFO: Telnet Password: 6b9b91bc6d1b537e
2021-02-03 19:40:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2021-02-03 19:40:30 [scrapy.core.engine] INFO: Spider opened
2021-02-03 19:40:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2021-02-03 19:40:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-03 19:40:30 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: taxform_scraper)
2021-02-03 19:40:30 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0,Platform macOS-11.1-x86_64-i386-64bit
2021-02-03 19:40:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-02-03 19:40:30 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-02-03 19:40:30 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0,Platform macOS-11.1-x86_64-i386-64bit
2021-02-03 19:40:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-02-03 19:40:30 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0'}
2021-02-03 19:40:30 [scrapy.extensions.telnet] INFO: Telnet Password: 77210fa8243f5811
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',scraped 0 items (at 0 items/min)
2021-02-03 19:40:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-02-03 19:40:30 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'taxform_scraper','NEWSPIDER_MODULE': 'taxform_scraper.spiders','ROBOTSTXT_OBEY': True,'SPIDER_LOADER_WARN_ONLY': True,'SPIDER_MODULES': ['taxform_scraper.spiders']}
2021-02-03 19:40:30 [scrapy.extensions.telnet] INFO: Telnet Password: 3666830b830f31d0
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.logstats.LogStats']
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-03 19:40:30 [scrapy.middleware] INFO: Enabled item pipelines:
['taxform_scraper.pipelines.TaxformScraperPipeline']
2021-02-03 19:40:30 [scrapy.core.engine] INFO: Spider opened
2021-02-03 19:40:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2021-02-03 19:40:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025
2021-02-03 19:40:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://apps.irs.gov/app/picklist/list/priorFormpublication.html> (referer: None)
2021-02-03 19:40:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://apps.irs.gov/app/picklist/list/priorFormpublication.html> (referer: None)
2021-02-03 19:40:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.irs.gov/404> from <GET https://apps.irs.gov/robots.txt>
2021-02-03 19:40:30 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-03 19:40:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 4264,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'elapsed_time_seconds': 0.467604,'finish_reason': 'finished','finish_time': datetime.datetime(2021,2,4,40,30,521028),'log_count/DEBUG': 3,'log_count/INFO': 19,'memusage/max': 51257344,'memusage/startup': 51257344,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2021,53424)}
2021-02-03 19:40:30 [scrapy.core.engine] INFO: Spider closed (finished)
2021-02-03 19:40:30 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-03 19:40:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,'elapsed_time_seconds': 0.483096,522640),'log_count/DEBUG': 5,'log_count/INFO': 35,'memusage/max': 51032064,'memusage/startup': 51032064,39544)}
2021-02-03 19:40:30 [scrapy.core.engine] INFO: Spider closed (finished)
2021-02-03 19:40:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.irs.gov/404> (referer: None)
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 5 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 17 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 18 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 20 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 25 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 26 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 63 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 64 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 69 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 92 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 128 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 142 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 259 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 289 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 403 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 649 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 679 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 700 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 718 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 832 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 994 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1020 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1150 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1166 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1180 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1210 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1240 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1243 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1246 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1249 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1252 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1255 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1258 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1261 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1273 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1276 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1279 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1296 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1302 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1308 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1314 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1320 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1326 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1334 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1357 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1390 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1392 without any user agent to enforce it on.
2021-02-03 19:40:30 [protego] DEBUG: Rule at line 1395 without any user agent to enforce it on.
2021-02-03 19:40:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://apps.irs.gov/app/picklist/list/priorFormpublication.html> (referer: None)
2021-02-03 19:40:31 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-03 19:40:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1299,'downloader/request_count': 3,'downloader/request_method_count/GET': 3,'downloader/response_bytes': 24872,'downloader/response_count': 3,'downloader/response_status_count/200': 2,'downloader/response_status_count/302': 1,'elapsed_time_seconds': 0.986045,31,54803),'log_count/DEBUG': 65,'log_count/INFO': 16,'memusage/max': 51466240,'memusage/startup': 51466240,'response_received_count': 2,'robotstxt/request_count': 1,'robotstxt/response_count': 1,'robotstxt/response_status_count/200': 1,68758)}
2021-02-03 19:40:31 [scrapy.core.engine] INFO: Spider closed (finished)

我在 Stackoverflow 上尝试了很多解决方案，但没有任何效果。我究竟做错了什么？如何下载一年范围内的文件？

解决方法

试试这个，效果很好

import scrapy

class PdfSpider(scrapy.Spider):
    name = 'pdfSpider'
    start_urls = [
          'https://apps.irs.gov/app/picklist/list/priorFormPublication.html',]

    def parse(self,response):
        url = response.url
        for link in response.css('table.picklist-dataTable'):
            links = link.css('td.LeftCellSpacer > a::attr("href")').extract()
            for pdfurl in links:
                yield scrapy.Request(pdfurl,callback=self.download_pdf,meta={'url': url,'path': pdfurl})

    def download_pdf(self,response):
        print(response.url)
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s',path)
        with open(path,'wb') as f:
            f.write(response.body)

python scrapy scrapy-pipeline web-scraping