Scrapy框架基本命令与settings.py设置

这篇文章主要介绍了Scrapy框架基本命令与settings.py设置,结合实例形式分析了创建爬虫项目、创建爬虫文件、存储、打开网页及settings.py设置等相关操作技巧,需要的朋友可以参考下

本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:

Scrapy框架基本命令

1.创建爬虫项目

scrapy startproject [项目名称]

2.创建爬虫文件

scrapy genspider +文件名+网址

3.运行(crawl)

scrapy crawl 爬虫名称 # -o output 输出数据到文件 scrapy crawl [爬虫名称] -o zufang.json scrapy crawl [爬虫名称] -o zufang.csv

4.check检查错误

scrapy check

5.list返回项目所有spider

scrapy list

6.view 存储、打开网页

scrapy view http://www.baidu.com

7.scrapy shell, 进入终端

scrapy shell https://www.baidu.com

8.scrapy runspider

scrapy runspider zufang_spider.py

Scrapy框架: settings.py设置

# -*- coding: utf-8 -*- # Scrapy settings for maitian project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'maitian' SPIDER_MODULES = ['maitian.spiders'] NEWSPIDER_MODULE = 'maitian.spiders' #不能批量设置 # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'maitian (+http://www.yourdomain.com)' #认遵守robots协议 # Obey robots.txt rules ROBOTSTXT_OBEY = False #设置日志文件 LOG_FILE="maitian.log" #日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL #等级越高 输出的日志越少 # LOG_LEVEL="INFO" #scrapy设置最大并发数 认16 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 #设置批量延迟请求16 等待3秒再发16 秒 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 #cookie 不生效 认是True # disable cookies (enabled by default) #COOKIES_ENABLED = False #远程 # disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False #加载认的请求头 # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} #爬虫中间件 # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'maitian.middlewares.MaitianSpiderMiddleware': 543, #} #下载中间件 # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'maitian.middlewares.MaitianDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} #在配置文件 开启管道 #优先级的范围 0--1000;值越小 优先级越高 # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELInes = { # 'maitian.pipelines.MaitianPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHRottLE_ENABLED = True # The initial download delay #AUTOTHRottLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHRottLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each Remote Server #AUTOTHRottLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHRottLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGnorE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

更多相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。

相关文章

功能概要:(目前已实现功能)公共展示部分:1.网站首页展示...
大体上把Python中的数据类型分为如下几类: Number(数字) ...
开发之前第一步,就是构造整个的项目结构。这就好比作一幅画...
源码编译方式安装Apache首先下载Apache源码压缩包,地址为ht...
前面说完了此项目的创建及数据模型设计的过程。如果未看过,...
python中常用的写爬虫的库有urllib2、requests,对于大多数比...