Scrapy:无法绑定:24:打开的文件太多

问题描述

我开始出现错误

2020-09-04 20:45:25 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url.com/> (Failed 2 times): Couldn't bind: 24: Too many open files.

我在Ubuntu上运行Scrapy,并将结果保存到Django数据库(Postgres)。

我不知道问题出在哪里。我有

class Profilesspider(BaseSpiderMixin,scrapy.Spider):
    name = 'db_profiles_spider'

    custom_settings = {
        'CONCURRENT_REQUESTS': 20,'LOG_FILE': 'profiles_spider.log','DOWNLOAD_TIMEOUT': 30,'DNS_TIMEOUT': 30,'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter','RETRY_TIMES': 1,'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/27.0.1453.93 Safari/537.36"

    }

    def start_requests(self):
        self._lock() # creates lock file 
        self.load_websites()
        self.buffer = []

        for website in self.websites:
            try:
                yield scrapy.Request(website.url,self.parse,Meta={'website': website})
            except ValueError:
                continue

    def parse(self,response: Response):
       
        Meta = response.Meta
        website = Meta['website']
        Meta_tags = utils.Meta_tags.extract_Meta_tags(response)
        
        ....

        website.profile_scraped_at = Now()
        website.save()
        profile.save()

    def error(self,failure):

        # log all failures
        Meta = failure.request.Meta
        website = Meta['website']


        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            website.set_response_code(response.status,save=False)

        elif failure.check(DNSLookupError):
            website.set_response_code(WebSite.RESPONSE_CODE__DNS_LOOKUP_ERROR,save=False)

        elif failure.check(TimeoutError,TCPTimedOutError):
            website.set_response_code(WebSite.RESPONSE_CODE__TIMEOUT,save=False)
        else:
            website.set_response_code(WebSite.RESPONSE_CODE__UNKNowN,save=False)

        website.scraped_at = Now()
        website.save()

这是我的设置:

CONCURRENT_REQUESTS = 60
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
DEPTH_PRIORITY = 1
SCHEDULER_disK_QUEUE = 'scrapy.squeues.PickleFifodiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# CONCURRENT_REQUESTS_PER_IP = 4

您知道问题出在哪里吗?

编辑:

我也做了:ulimit -n 1000000

EDIT2:

我正在使用Django管理操作和子流程执行蜘蛛程序:

def runspider__profiles(modeladmin,request,queryset):
    ids = '.'.join([str(x) for x in queryset.values_list('id',flat=True)])
    cmd = ' '.join(["nohup",settings.CRAWL_SH_ABS_PATH,"db_profiles_spider","ids",ids,'&'])
    subprocess.call(cmd,shell=True)

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)