带有 Scrapy 的 Celery 不解析 CSV 文件

问题描述

任务本身立即启动,但它会尽快结束,而且我没有看到任务的结果,它根本没有进入管道。当我编写代码并使用 scrapy crawl <spider_name> 命令运行它时,一切正常。我在使用 Celery 时遇到了这个问题。

我的 Celery 工人日志:

[2021-02-13 14:25:00,208: INFO/MainProcess] Received task: crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3]  
[2021-02-13 16:25:00,867: INFO/ForkPoolWorker-1] Scrapy 2.4.0 started (bot: crawling)
[2021-02-13 16:25:00,869: INFO/ForkPoolWorker-1] Versions: lxml 4.6.1.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.7 (default,Jan 12 2021,17:06:28) - [GCC 8.3.0],pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020),cryptography 3.2.1,Platform Linux-5.8.0-41-generic-x86_64-with-glibc2.2.5
[2021-02-13 16:25:00,869: DEBUG/ForkPoolWorker-1] Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-02-13 16:25:00,879: INFO/ForkPoolWorker-1] Overridden settings:
{'BOT_NAME': 'crawling','DOWNLOAD_TIMEOUT': 600,'DOWNLOAD_WARNSIZE': 267386880,'NEWSPIDER_MODULE': 'crawling.crawling.spiders','SPIDER_MODULES': ['crawling.crawling.spiders'],'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64)'}
[2021-02-13 16:25:01,018: INFO/ForkPoolWorker-1] Telnet Password: d95c783294fc93df
[2021-02-13 16:25:01,064: INFO/ForkPoolWorker-1] Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats'][2021-02-13 16:25:01,151: INFO/ForkPoolWorker-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-02-13 16:25:01,172: INFO/ForkPoolWorker-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-02-13 16:25:01,183: INFO/ForkPoolWorker-1] Task crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3] succeeded in 0.9719750949989248s: None
[2021-02-13 16:25:01,285: INFO/ForkPoolWorker-1] Received SIGTERM,shutting down gracefully. Send again to force

我有以下蜘蛛:

class copartSpider(CSVFeedSpider):
    name = '<spider_name>'
    allowed_domains = ['<allowed_domain>']
    start_urls = [
        'file:///code/autotracker/crawling/data/salesdata.cgi'
    ]

我的部分 Scrapy 设置(没有其他与 Scrapy 直接相关的内容):

BOT_NAME = 'crawling'

SPIDER_MODULES = ['crawling.crawling.spiders']
NEWSPIDER_MODULE = 'crawling.crawling.spiders'

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64)'

ROBOTSTXT_OBEY = False

DOWNLOAD_TIMEOUT = 600    # 10 min
DOWNLOAD_WARNSIZE = 255 * 1024 * 1024    # 255 mb

DEFAULT_REQUEST_HEADERS = {
  'Accept': '*/*','Accept-Language': 'en',}

ITEM_PIPELInes = {
   'crawling.pipelines.Autopipeline': 1,}

我有两个 Celery 配置文件

celery.py

from celery import Celery
from celery.schedules import crontab

broKER_URL = 'redis://redis:6379/0'
app = Celery('crawling',broker=broKER_URL)

app.conf.beat_schedule = {
    'scrape-every-20-minutes': {
        'task': 'crawling.crawling.tasks.start_crawler_process','schedule': crontab(minute='*/5'),}
}

tasks.py

@app.task
def start_crawler_process():
    process = CrawlerProcess(get_project_settings())
    process.crawl('<spider_name>')
    process.start()

解决方法

原因: Scrapy 不允许运行其他进程。

解决方案:我使用了自己的脚本 - https://github.com/dtalkachou/scrapy-crawler-script