问题描述
我开始出现错误:
2020-09-04 20:45:25 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url.com/> (Failed 2 times): Couldn't bind: 24: Too many open files.
我在Ubuntu上运行Scrapy,并将结果保存到Django数据库(Postgres)。
我不知道问题出在哪里。我有
class Profilesspider(BaseSpiderMixin,scrapy.Spider):
name = 'db_profiles_spider'
custom_settings = {
'CONCURRENT_REQUESTS': 20,'LOG_FILE': 'profiles_spider.log','DOWNLOAD_TIMEOUT': 30,'DNS_TIMEOUT': 30,'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter','RETRY_TIMES': 1,'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/27.0.1453.93 Safari/537.36"
}
def start_requests(self):
self._lock() # creates lock file
self.load_websites()
self.buffer = []
for website in self.websites:
try:
yield scrapy.Request(website.url,self.parse,Meta={'website': website})
except ValueError:
continue
def parse(self,response: Response):
Meta = response.Meta
website = Meta['website']
Meta_tags = utils.Meta_tags.extract_Meta_tags(response)
....
website.profile_scraped_at = Now()
website.save()
profile.save()
def error(self,failure):
# log all failures
Meta = failure.request.Meta
website = Meta['website']
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
website.set_response_code(response.status,save=False)
elif failure.check(DNSLookupError):
website.set_response_code(WebSite.RESPONSE_CODE__DNS_LOOKUP_ERROR,save=False)
elif failure.check(TimeoutError,TCPTimedOutError):
website.set_response_code(WebSite.RESPONSE_CODE__TIMEOUT,save=False)
else:
website.set_response_code(WebSite.RESPONSE_CODE__UNKNowN,save=False)
website.scraped_at = Now()
website.save()
这是我的设置:
CONCURRENT_REQUESTS = 60
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
DEPTH_PRIORITY = 1
SCHEDULER_disK_QUEUE = 'scrapy.squeues.PickleFifodiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# CONCURRENT_REQUESTS_PER_IP = 4
您知道问题出在哪里吗?
编辑:
我也做了:ulimit -n 1000000
EDIT2:
我正在使用Django管理操作和子流程执行蜘蛛程序:
def runspider__profiles(modeladmin,request,queryset):
ids = '.'.join([str(x) for x in queryset.values_list('id',flat=True)])
cmd = ' '.join(["nohup",settings.CRAWL_SH_ABS_PATH,"db_profiles_spider","ids",ids,'&'])
subprocess.call(cmd,shell=True)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)