问题描述
任务本身立即启动,但它会尽快结束,而且我没有看到任务的结果,它根本没有进入管道。当我编写代码并使用 scrapy crawl <spider_name>
命令运行它时,一切正常。我在使用 Celery 时遇到了这个问题。
我的 Celery 工人日志:
[2021-02-13 14:25:00,208: INFO/MainProcess] Received task: crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3]
[2021-02-13 16:25:00,867: INFO/ForkPoolWorker-1] Scrapy 2.4.0 started (bot: crawling)
[2021-02-13 16:25:00,869: INFO/ForkPoolWorker-1] Versions: lxml 4.6.1.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.7 (default,Jan 12 2021,17:06:28) - [GCC 8.3.0],pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020),cryptography 3.2.1,Platform Linux-5.8.0-41-generic-x86_64-with-glibc2.2.5
[2021-02-13 16:25:00,869: DEBUG/ForkPoolWorker-1] Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-02-13 16:25:00,879: INFO/ForkPoolWorker-1] Overridden settings:
{'BOT_NAME': 'crawling','DOWNLOAD_TIMEOUT': 600,'DOWNLOAD_WARNSIZE': 267386880,'NEWSPIDER_MODULE': 'crawling.crawling.spiders','SPIDER_MODULES': ['crawling.crawling.spiders'],'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64)'}
[2021-02-13 16:25:01,018: INFO/ForkPoolWorker-1] Telnet Password: d95c783294fc93df
[2021-02-13 16:25:01,064: INFO/ForkPoolWorker-1] Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats'][2021-02-13 16:25:01,151: INFO/ForkPoolWorker-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-02-13 16:25:01,172: INFO/ForkPoolWorker-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-02-13 16:25:01,183: INFO/ForkPoolWorker-1] Task crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3] succeeded in 0.9719750949989248s: None
[2021-02-13 16:25:01,285: INFO/ForkPoolWorker-1] Received SIGTERM,shutting down gracefully. Send again to force
我有以下蜘蛛:
class copartSpider(CSVFeedSpider):
name = '<spider_name>'
allowed_domains = ['<allowed_domain>']
start_urls = [
'file:///code/autotracker/crawling/data/salesdata.cgi'
]
我的部分 Scrapy 设置(没有其他与 Scrapy 直接相关的内容):
BOT_NAME = 'crawling'
SPIDER_MODULES = ['crawling.crawling.spiders']
NEWSPIDER_MODULE = 'crawling.crawling.spiders'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64)'
ROBOTSTXT_OBEY = False
DOWNLOAD_TIMEOUT = 600 # 10 min
DOWNLOAD_WARNSIZE = 255 * 1024 * 1024 # 255 mb
DEFAULT_REQUEST_HEADERS = {
'Accept': '*/*','Accept-Language': 'en',}
ITEM_PIPELInes = {
'crawling.pipelines.Autopipeline': 1,}
celery.py
from celery import Celery
from celery.schedules import crontab
broKER_URL = 'redis://redis:6379/0'
app = Celery('crawling',broker=broKER_URL)
app.conf.beat_schedule = {
'scrape-every-20-minutes': {
'task': 'crawling.crawling.tasks.start_crawler_process','schedule': crontab(minute='*/5'),}
}
tasks.py
@app.task
def start_crawler_process():
process = CrawlerProcess(get_project_settings())
process.crawl('<spider_name>')
process.start()
解决方法
原因: Scrapy 不允许运行其他进程。
解决方案:我使用了自己的脚本 - https://github.com/dtalkachou/scrapy-crawler-script