禁止的问题

问题描述

我正在使用scrapy 1.12来爬行分类网站。

本地主机中的搜寻器正在运行，但服务器（centos）无法运行。

我正在使用randomuseragent和randomproxy。

我的settings.py文件

BOT_NAME = 'xx(https://xx.com)'

SPIDER_MODULES = ['xyz_crawler.spiders']
NEWSPIDER_MODULE = 'xyz_crawler.spiders'
ITEM_PIPELInes = {
    'xyz_crawler.pipelines.XmlWriterPipeline': 800
    }
# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'null'


# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8


# Retry many times since proxies often fail
RETRY_TIMES = 1
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500,503,504,400,403,404,408]

DOWNLOADER_MIDDLEWARES = {
#   'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,'scrapy_proxies.RandomProxy': 100,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,}

RANDOM_UA_FILE = "xyz_crawler/useragents.txt"
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = 'xyz_crawler/proxies.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0 

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"


DOWNLOADER_CLIENTCONTEXTFACTORY = 'xyz_crawler.contextfactory.CustomContextFactory'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16

# disable cookies (enabled by default)
#COOKIES_ENABLED=False

# disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',#   'Accept-Language': 'en',#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'xyz_crawler.middlewares.MyCustomSpiderMiddleware': 543,#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,#   'xyz_crawler.middlewares.TestDownloader': 100,#}

代理IP地址和用户代理在两个地方都相同。

我尝试了COOKIES_ENABLED：false，但还是无法正常工作。

为什么这不起作用？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

scrapy scrapy-splash screen-scraping web-crawler