Scrapy框架基本命令与settings.py设置
本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:
Scrapy框架基本命令
1.创建爬虫项目
scrapystartproject[项目名称]
2.创建爬虫文件
scrapygenspider+文件名+网址
3.运行(crawl)
scrapycrawl爬虫名称 #-ooutput输出数据到文件 scrapycrawl[爬虫名称]-ozufang.json scrapycrawl[爬虫名称]-ozufang.csv
4.check检查错误
scrapycheck
5.list返回项目所有spider
scrapylist
6.view存储、打开网页
scrapyviewhttp://www.baidu.com
7.scrapyshell,进入终端
scrapyshellhttps://www.baidu.com
8.scrapyrunspider
scrapyrunspiderzufang_spider.py
Scrapy框架:settings.py设置
#-*-coding:utf-8-*-
#Scrapysettingsformaitianproject
#
#Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor
#commonlyused.Youcanfindmoresettingsconsultingthedocumentation:
#
#https://doc.scrapy.org/en/latest/topics/settings.html
#https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME='maitian'
SPIDER_MODULES=['maitian.spiders']
NEWSPIDER_MODULE='maitian.spiders'
#不能批量设置
#Crawlresponsiblybyidentifyingyourself(andyourwebsite)ontheuser-agent
USER_AGENT='maitian(+http://www.yourdomain.com)'
#默认遵守robots协议
#Obeyrobots.txtrules
ROBOTSTXT_OBEY=False
#设置日志文件
LOG_FILE="maitian.log"
#日志等级分为5种:1.DEBUG2.INFO3.Warning4.ERROR5.CRITICAL
#等级越高输出的日志越少
#LOG_LEVEL="INFO"
#scrapy设置最大并发数默认16
#ConfiguremaximumconcurrentrequestsperformedbyScrapy(default:16)
#CONCURRENT_REQUESTS=32
#设置批量延迟请求16等待3秒再发16秒
#Configureadelayforrequestsforthesamewebsite(default:0)
#Seehttps://doc.scrapy.org/en/latest/topics/settings.html#download-delay
#Seealsoautothrottlesettingsanddocs
#DOWNLOAD_DELAY=3
#Thedownloaddelaysettingwillhonoronlyoneof:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
#cookie不生效默认是True
#Disablecookies(enabledbydefault)
#COOKIES_ENABLED=False
#远程
#DisableTelnetConsole(enabledbydefault)
#TELNETCONSOLE_ENABLED=False
#加载默认的请求头
#Overridethedefaultrequestheaders:
#DEFAULT_REQUEST_HEADERS={
#'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#'Accept-Language':'en',
#}
#爬虫中间件
#Enableordisablespidermiddlewares
#Seehttps://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES={
#'maitian.middlewares.MaitianSpiderMiddleware':543,
#}
#下载中间件
#Enableordisabledownloadermiddlewares
#Seehttps://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES={
#'maitian.middlewares.MaitianDownloaderMiddleware':543,
#}
#Enableordisableextensions
#Seehttps://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS={
#'scrapy.extensions.telnet.TelnetConsole':None,
#}
#在配置文件开启管道
#优先级的范围0--1000;值越小优先级越高
#Configureitempipelines
#Seehttps://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES={
#'maitian.pipelines.MaitianPipeline':300,
#}
#EnableandconfiguretheAutoThrottleextension(disabledbydefault)
#Seehttps://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED=True
#Theinitialdownloaddelay
#AUTOTHROTTLE_START_DELAY=5
#Themaximumdownloaddelaytobesetincaseofhighlatencies
#AUTOTHROTTLE_MAX_DELAY=60
#TheaveragenumberofrequestsScrapyshouldbesendinginparallelto
#eachremoteserver
#AUTOTHROTTLE_TARGET_CONCURRENCY=1.0
#Enableshowingthrottlingstatsforeveryresponsereceived:
#AUTOTHROTTLE_DEBUG=False
#EnableandconfigureHTTPcaching(disabledbydefault)
#Seehttps://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
更多相关内容可查看本站专题:《PythonSocket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》
希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。
声明:本文内容来源于网络,版权归原作者所有,内容由互联网用户自发贡献自行上传,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任。如果您发现有涉嫌版权的内容,欢迎发送邮件至:czq8825#qq.com(发邮件时,请将#更换为@)进行举报,并提供相关证据,一经查实,本站将立刻删除涉嫌侵权内容。