Scrapy爬虫实例讲解_校花网

2023-09-22 18:52:05 296

学习爬虫有一段时间了，今天使用Scrapy框架将校花网的图片爬取到本地。Scrapy爬虫框架相对于使用requests库进行网页的爬取，拥有更高的性能。

Scrapy官方定义：Scrapy是用于抓取网站并提取结构化数据的应用程序框架，可用于广泛的有用应用程序，如数据挖掘，信息处理或历史存档。

建立Scrapy爬虫工程

在安装好Scrapy框架后，直接使用命令行进行项目的创建：

E:\ScrapyDemo>scrapystartprojectxiaohuar
NewScrapyproject'xiaohuar',usingtemplatedirectory'c:\\users\\lei\\appdata\\local\\programs\\python\\python35\\lib
\\site-packages\\scrapy\\templates\\project',createdin:
E:\ScrapyDemo\xiaohuar

Youcanstartyourfirstspiderwith:
cdxiaohuar
scrapygenspiderexampleexample.com

创建一个Scrapy爬虫

创建工程的时候，会自动创建一个与工程同名的目录，进入到目录中执行如下命令：

E:\ScrapyDemo\xiaohuar>scrapygenspider-tbasicxiaohuaxiaohuar.com
Createdspider'xiaohua'usingtemplate'basic'inmodule:
xiaohuar.spiders.xiaohua命令中"xiaohua"

是生成Spider中*.py文件的文件名，"xiaohuar.com"是将要爬取网站的URL，可以在程序中更改。

编写Spider代码

编写E:\ScrapyDemo\xiaohuar\xiaohuar\spiders中的xiaohua.py文件。主要是配置URL和对请求到的页面的解析方式。

#-*-coding:utf-8-*-
importscrapy
fromscrapy.httpimportRequest
importre

classXiaohuaSpider(scrapy.Spider):
name='xiaohua'
allowed_domains=['xiaohuar.com']
start_urls=[]
foriinrange(43):
url="http://www.xiaohuar.com/list-1-%s.html"%i
start_urls.append(url)

defparse(self,response):
if"www.xiaohuar.com/list-1"inresponse.url:
#下载的html源代码
html=response.text
#网页中图片存储地址：src="/d/file/20160126/905e563421921adf9b6fb4408ec4e72f.jpg"
#通过正则匹配到所有的图片
#获取的是图片的相对路径的列表
img_urls=re.findall(r'/d/file/\d+/\w+\.jpg',html)

#使用循环对图片页进行请求
forimg_urlinimg_urls:
#将图片的URL补全
if"http://"notinimg_url:
img_url="http://www.xiaohuar.com%s"%img_url

#回调，返回response
yieldRequest(img_url)
else:
#下载图片
url=response.url
#保存的图片文件名
title=re.findall(r'\w*.jpg',url)[0]
#保存图片
withopen('E:\\xiaohua_img\\%s'%title,'wb')asf:
f.write(response.body)

这里使用正则表达式对图片的地址进行匹配，其他网页也都大同小异，需要根据具体的网页源代码进行分析。

运行爬虫

E:\ScrapyDemo\xiaohuar>scrapycrawlxiaohua
2017-10-2222:30:11[scrapy.utils.log]INFO:Scrapy1.4.0started(bot:xiaohuar)
2017-10-2222:30:11[scrapy.utils.log]INFO:Overriddensettings:{'BOT_NAME':'xiaohuar','SPIDER_MODULES':['xiaohuar.
spiders'],'ROBOTSTXT_OBEY':True,'NEWSPIDER_MODULE':'xiaohuar.spiders'}
2017-10-2222:30:11[scrapy.middleware]INFO:Enabledextensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2017-10-2222:30:12[scrapy.middleware]INFO:Enableddownloadermiddlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-2222:30:12[scrapy.middleware]INFO:Enabledspidermiddlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-2222:30:12[scrapy.middleware]INFO:Enableditempipelines:
[]
2017-10-2222:30:12[scrapy.core.engine]INFO:Spideropened
2017-10-2222:30:12[scrapy.extensions.logstats]INFO:Crawled0pages(at0pages/min),scraped0items(at0items/min
)
2017-10-2222:30:12[scrapy.extensions.telnet]DEBUG:Telnetconsolelisteningon127.0.0.1:6023
2017-10-2222:30:12[scrapy.core.engine]DEBUG:Crawled(200)(referer:None)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:None
)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:13[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:14[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]DEBUG:Crawled(200)(referer:http://www.xiaohuar.com/list-1-0.html)
2017-10-2222:30:15[scrapy.core.engine]INFO:Closingspider(finished)
2017-10-2222:30:15[scrapy.statscollectors]INFO:DumpingScrapystats:
{'downloader/request_bytes':8785,
'downloader/request_count':24,
'downloader/request_method_count/GET':24,
'downloader/response_bytes':2278896,
'downloader/response_count':24,
'downloader/response_status_count/200':24,
'finish_reason':'finished',
'finish_time':datetime.datetime(2017,10,22,14,30,15,892287),
'log_count/DEBUG':25,
'log_count/INFO':7,
'request_depth_max':1,
'response_received_count':24,
'scheduler/dequeued':23,
'scheduler/dequeued/memory':23,
'scheduler/enqueued':23,
'scheduler/enqueued/memory':23,
'start_time':datetime.datetime(2017,10,22,14,30,12,698874)}
2017-10-2222:30:15[scrapy.core.engine]INFO:Spiderclosed(finished)
scrapycrawlxiaohua

图片保存

在图片保存过程中"\"需要进行转义。

>>>r=requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1508693697147&di=23eb655d8e450f84cf39453bc1029bc0&imgtype=0&src=http%3A%2F%2Fb.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2Fc9fcc3cec3fdfc038b027f7bde3f8794a5c226fe.jpg")
>>>open("E:\xiaohua_img\01.jpg",'wb').write(r.content)
File"",line1
SyntaxError:(unicodeerror)'unicodeescape'codeccan'tdecodeby
>>>open("E:\\xiaohua_img\1.jpg",'wb').write(r.content)
Traceback(mostrecentcalllast):
File"",line1,in
OSError:[Errno22]Invalidargument:'E:\\xiaohua_img\x01.jpg'
>>>open("E:\\xiaohua_img\\1.jpg",'wb').write(r.content)

以上这篇Scrapy爬虫实例讲解_校花网就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持毛票票。

声明：本文内容来源于网络，版权归原作者所有，内容由互联网用户自发贡献自行上传，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任。如果您发现有涉嫌版权的内容，欢迎发送邮件至：czq8825#qq.com（发邮件时，请将#更换为@）进行举报，并提供相关证据，一经查实，本站将立刻删除涉嫌侵权内容。

Scrapy爬虫实例讲解_校花网

热门推荐

随机推荐