使用子流程时,如何使用scrapy-redis处理非法URL?

问题描述

我想向html请求推送到redis的URL,它运行良好。

如果我使用cmdline运行,则可以轻松处理ValueError并继续进行操作。

但是,当我使用python(在子进程下)运行时,一旦遇到来自非法“ URL”的ValueError(例如javascript:void(0),电子邮件链接),它将立即完成,而其余URL则不运行。

当使用python.subprocess运行scrapy-redis时,你们如何处理非法URL?

    # request html (or other info)
    # TODO: dupefilter implemented by scratch
    # push urls to redis
    r = redis.Redis(host='localhost',port=6379,db=0)
    r.delete('MySpider:start_urls')
    r.delete("MySpider:items")
    for url in url_list:
        print(start_urls)
        print('-----------------')
        print(url_list)
        r.lpush('MySpider:start_urls',url)

    # scrap info from urls
    urls_from_scrapy = []
    html_strs_from_scrapy = []

    worker = subprocess.Popen("scrapy crawl MySpider".split())  # error when LOG arg added
    worker.wait(timeout=None)

这是我看到的错误:

2020-08-11 09:26:21 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RedisMixin.spider_idle of <MySpider 'MySpider' at 0x201fde4ef48>>
Traceback (most recent call last):

  File "c:\programdata\anaconda3\lib\site-packages\scrapy\utils\signal.py",line 32,in send_catch_log
    response = robustApply(receiver,signal=signal,sender=sender,*arguments,**named)

  File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py",line 55,in robustApply
    return receiver(*arguments,**named)

  File "D:\PycharmProjects\genetic_algo_crawl\demo\scrapy_redis\spiders.py",line 128,in spider_idle
    self.schedule_next_requests()

  File "D:\PycharmProjects\genetic_algo_crawl\demo\scrapy_redis\spiders.py",line 122,in schedule_next_requests
    for req in self.next_requests():

  File "D:\PycharmProjects\genetic_algo_crawl\demo\scrapy_redis\spiders.py",line 91,in next_requests
    req = self.make_request_from_data(data)

  File "D:\PycharmProjects\genetic_algo_crawl\demo\scrapy_redis\spiders.py",line 117,in make_request_from_data
    return self.make_requests_from_url(url)

  File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py",line 87,in make_requests_from_url
    return Request(url,dont_filter=True)

  File "c:\programdata\anaconda3\lib\site-packages\scrapy\http\request\__init__.py",line 25,in __init__
    self._set_url(url)

  File "c:\programdata\anaconda3\lib\site-packages\scrapy\http\request\__init__.py",line 69,in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)

ValueError: Missing scheme in request url: mailto:sjxx@xidian.edu.cn
2020-08-11 09:26:21 [scrapy.core.engine] INFO: Closing spider (finished)

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...