Scrapy框架基本命令与settings.py设置
本文实例讲述了Scrapy框架基本命令与settings.py设置。分享给大家供大家参考,具体如下:
Scrapy框架基本命令
1.创建爬虫项目
scrapystartproject[项目名称]
2.创建爬虫文件
scrapygenspider+文件名+网址
3.运行(crawl)
scrapycrawl爬虫名称 #-ooutput输出数据到文件 scrapycrawl[爬虫名称]-ozufang.json scrapycrawl[爬虫名称]-ozufang.csv
4.check检查错误
scrapycheck
5.list返回项目所有spider
scrapylist
6.view存储、打开网页
scrapyviewhttp://www.baidu.com
7.scrapyshell,进入终端
scrapyshellhttps://www.baidu.com
8.scrapyrunspider
scrapyrunspiderzufang_spider.py
Scrapy框架:settings.py设置
#-*-coding:utf-8-*- #Scrapysettingsformaitianproject # #Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor #commonlyused.Youcanfindmoresettingsconsultingthedocumentation: # #https://doc.scrapy.org/en/latest/topics/settings.html #https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME='maitian' SPIDER_MODULES=['maitian.spiders'] NEWSPIDER_MODULE='maitian.spiders' #不能批量设置 #Crawlresponsiblybyidentifyingyourself(andyourwebsite)ontheuser-agent USER_AGENT='maitian(+http://www.yourdomain.com)' #默认遵守robots协议 #Obeyrobots.txtrules ROBOTSTXT_OBEY=False #设置日志文件 LOG_FILE="maitian.log" #日志等级分为5种:1.DEBUG2.INFO3.Warning4.ERROR5.CRITICAL #等级越高输出的日志越少 #LOG_LEVEL="INFO" #scrapy设置最大并发数默认16 #ConfiguremaximumconcurrentrequestsperformedbyScrapy(default:16) #CONCURRENT_REQUESTS=32 #设置批量延迟请求16等待3秒再发16秒 #Configureadelayforrequestsforthesamewebsite(default:0) #Seehttps://doc.scrapy.org/en/latest/topics/settings.html#download-delay #Seealsoautothrottlesettingsanddocs #DOWNLOAD_DELAY=3 #Thedownloaddelaysettingwillhonoronlyoneof: #CONCURRENT_REQUESTS_PER_DOMAIN=16 #CONCURRENT_REQUESTS_PER_IP=16 #cookie不生效默认是True #Disablecookies(enabledbydefault) #COOKIES_ENABLED=False #远程 #DisableTelnetConsole(enabledbydefault) #TELNETCONSOLE_ENABLED=False #加载默认的请求头 #Overridethedefaultrequestheaders: #DEFAULT_REQUEST_HEADERS={ #'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', #'Accept-Language':'en', #} #爬虫中间件 #Enableordisablespidermiddlewares #Seehttps://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES={ #'maitian.middlewares.MaitianSpiderMiddleware':543, #} #下载中间件 #Enableordisabledownloadermiddlewares #Seehttps://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES={ #'maitian.middlewares.MaitianDownloaderMiddleware':543, #} #Enableordisableextensions #Seehttps://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS={ #'scrapy.extensions.telnet.TelnetConsole':None, #} #在配置文件开启管道 #优先级的范围0--1000;值越小优先级越高 #Configureitempipelines #Seehttps://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES={ #'maitian.pipelines.MaitianPipeline':300, #} #EnableandconfiguretheAutoThrottleextension(disabledbydefault) #Seehttps://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED=True #Theinitialdownloaddelay #AUTOTHROTTLE_START_DELAY=5 #Themaximumdownloaddelaytobesetincaseofhighlatencies #AUTOTHROTTLE_MAX_DELAY=60 #TheaveragenumberofrequestsScrapyshouldbesendinginparallelto #eachremoteserver #AUTOTHROTTLE_TARGET_CONCURRENCY=1.0 #Enableshowingthrottlingstatsforeveryresponsereceived: #AUTOTHROTTLE_DEBUG=False #EnableandconfigureHTTPcaching(disabledbydefault) #Seehttps://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED=True #HTTPCACHE_EXPIRATION_SECS=0 #HTTPCACHE_DIR='httpcache' #HTTPCACHE_IGNORE_HTTP_CODES=[] #HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
更多相关内容可查看本站专题:《PythonSocket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》
希望本文所述对大家基于Scrapy框架的Python程序设计有所帮助。
声明:本文内容来源于网络,版权归原作者所有,内容由互联网用户自发贡献自行上传,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任。如果您发现有涉嫌版权的内容,欢迎发送邮件至:czq8825#qq.com(发邮件时,请将#更换为@)进行举报,并提供相关证据,一经查实,本站将立刻删除涉嫌侵权内容。