Scrapy爬虫实例讲解_校花网
学习爬虫有一段时间了,今天使用Scrapy框架将校花网的图片爬取到本地。Scrapy爬虫框架相对于使用requests库进行网页的爬取,拥有更高的性能。
Scrapy官方定义:Scrapy是用于抓取网站并提取结构化数据的应用程序框架,可用于广泛的有用应用程序,如数据挖掘,信息处理或历史存档。
建立Scrapy爬虫工程
在安装好Scrapy框架后,直接使用命令行进行项目的创建:
E:\ScrapyDemo>scrapystartprojectxiaohuar NewScrapyproject'xiaohuar',usingtemplatedirectory'c:\\users\\lei\\appdata\\local\\programs\\python\\python35\\lib \\site-packages\\scrapy\\templates\\project',createdin: E:\ScrapyDemo\xiaohuar Youcanstartyourfirstspiderwith: cdxiaohuar scrapygenspiderexampleexample.com
创建一个Scrapy爬虫
创建工程的时候,会自动创建一个与工程同名的目录,进入到目录中执行如下命令:
E:\ScrapyDemo\xiaohuar>scrapygenspider-tbasicxiaohuaxiaohuar.com Createdspider'xiaohua'usingtemplate'basic'inmodule: xiaohuar.spiders.xiaohua命令中"xiaohua"
是生成Spider中*.py文件的文件名,"xiaohuar.com"是将要爬取网站的URL,可以在程序中更改。
编写Spider代码
编写E:\ScrapyDemo\xiaohuar\xiaohuar\spiders中的xiaohua.py文件。主要是配置URL和对请求到的页面的解析方式。
#-*-coding:utf-8-*-
importscrapy
fromscrapy.httpimportRequest
importre
classXiaohuaSpider(scrapy.Spider):
name='xiaohua'
allowed_domains=['xiaohuar.com']
start_urls=[]
foriinrange(43):
url="http://www.xiaohuar.com/list-1-%s.html"%i
start_urls.append(url)
defparse(self,response):
if"www.xiaohuar.com/list-1"inresponse.url:
#下载的html源代码
html=response.text
#网页中图片存储地址:src="/d/file/20160126/905e563421921adf9b6fb4408ec4e72f.jpg"
#通过正则匹配到所有的图片
#获取的是图片的相对路径的列表
img_urls=re.findall(r'/d/file/\d+/\w+\.jpg',html)
#使用循环对图片页进行请求
forimg_urlinimg_urls:
#将图片的URL补全
if"http://"notinimg_url:
img_url="http://www.xiaohuar.com%s"%img_url
#回调,返回response
yieldRequest(img_url)
else:
#下载图片
url=response.url
#保存的图片文件名
title=re.findall(r'\w*.jpg',url)[0]
#保存图片
withopen('E:\\xiaohua_img\\%s'%title,'wb')asf:
f.write(response.body)
这里使用正则表达式对图片的地址进行匹配,其他网页也都大同小异,需要根据具体的网页源代码进行分析。
运行爬虫
E:\ScrapyDemo\xiaohuar>scrapycrawlxiaohua
2017-10-2222:30:11[scrapy.utils.log]INFO:Scrapy1.4.0started(bot:xiaohuar)
2017-10-2222:30:11[scrapy.utils.log]INFO:Overriddensettings:{'BOT_NAME':'xiaohuar','SPIDER_MODULES':['xiaohuar.
spiders'],'ROBOTSTXT_OBEY':True,'NEWSPIDER_MODULE':'xiaohuar.spiders'}
2017-10-2222:30:11[scrapy.middleware]INFO:Enabledextensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2017-10-2222:30:12[scrapy.middleware]INFO:Enableddownloadermiddlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-2222:30:12[scrapy.middleware]INFO:Enabledspidermiddlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-2222:30:12[scrapy.middleware]INFO:Enableditempipelines:
[]
2017-10-2222:30:12[scrapy.core.engine]INFO:Spideropened
2017-10-2222:30:12[scrapy.extensions.logstats]INFO:Crawled0pages(at0pages/min),scraped0items(at0items/min
)
2017-10-2222:30:12[scrapy.extensions.telnet]DEBUG:Telnetconsolelisteningon127.0.0.1:6023
2017-10-2222:30:12[scrapy.core.engine]DEBUG:Crawled(200)(referer:None)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:None
)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]INFO:Closingspider(finished)
2017-10-2222:30:15[scrapy.statscollectors]INFO:DumpingScrapystats:
{'downloader/request_bytes':8785,
'downloader/request_count':24,
'downloader/request_method_count/GET':24,
'downloader/response_bytes':2278896,
'downloader/response_count':24,
'downloader/response_status_count/200':24,
'finish_reason':'finished',
'finish_time':datetime.datetime(2017,10,22,14,30,15,892287),
'log_count/DEBUG':25,
'log_count/INFO':7,
'request_depth_max':1,
'response_received_count':24,
'scheduler/dequeued':23,
'scheduler/dequeued/memory':23,
'scheduler/enqueued':23,
'scheduler/enqueued/memory':23,
'start_time':datetime.datetime(2017,10,22,14,30,12,698874)}
2017-10-2222:30:15[scrapy.core.engine]INFO:Spiderclosed(finished)
scrapycrawlxiaohua
图片保存
在图片保存过程中"\"需要进行转义。
>>>r=requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1508693697147&di=23eb655d8e450f84cf39453bc1029bc0&imgtype=0&src=http%3A%2F%2Fb.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2Fc9fcc3cec3fdfc038b027f7bde3f8794a5c226fe.jpg")
>>>open("E:\xiaohua_img\01.jpg",'wb').write(r.content)
File"",line1
SyntaxError:(unicodeerror)'unicodeescape'codeccan'tdecodeby
>>>open("E:\\xiaohua_img\1.jpg",'wb').write(r.content)
Traceback(mostrecentcalllast):
File"",line1,in
OSError:[Errno22]Invalidargument:'E:\\xiaohua_img\x01.jpg'
>>>open("E:\\xiaohua_img\\1.jpg",'wb').write(r.content)
以上这篇Scrapy爬虫实例讲解_校花网就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持毛票票。
声明:本文内容来源于网络,版权归原作者所有,内容由互联网用户自发贡献自行上传,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任。如果您发现有涉嫌版权的内容,欢迎发送邮件至:czq8825#qq.com(发邮件时,请将#更换为@)进行举报,并提供相关证据,一经查实,本站将立刻删除涉嫌侵权内容。