Python使用mongodb保存爬取豆瓣电影的数据过程解析
创建爬虫项目douban
scrapystartprojectdouban
设置items.py文件,存储要保存的数据类型和字段名称
#-*-coding:utf-8-*- importscrapy classDoubanItem(scrapy.Item): title=scrapy.Field() #内容 content=scrapy.Field() #评分 rating_num=scrapy.Field() #简介 quote=scrapy.Field()
设置爬虫文件doubanmovies.py
#-*-coding:utf-8-*-
importscrapy
fromdouban.itemsimportDoubanItem
classDoubanmoviesSpider(scrapy.Spider):
name='doubanmovies'
allowed_domains=['movie.douban.com']
offset=0
url='https://movie.douban.com/top250?start='
start_urls=[url+str(offset)]
defparse(self,response):
#print('*'*60)
#print(response.url)
#print('*'*60)
item=DoubanItem()
info=response.xpath("//div[@class='info']")
foreachininfo:
item['title']=each.xpath(".//span[@class='title'][1]/text()").extract()
item['content']=each.xpath(".//div[@class='bd']/p[1]/text()").extract()
item['rating_num']=each.xpath(".//span[@class='rating_num']/text()").extract()
item['quote']=each.xpath(".//span[@class='inq']/text()").extract()
yielditem
#print(item)
self.offset+=25
ifself.offset<=250:
yieldscrapy.Request(self.url+str(self.offset),callback=self.parse)
设置管道文件,使用mongodb数据库来保存爬取的数据。重点部分
#-*-coding:utf-8-*- fromscrapy.confimportsettings importpymongo classDoubanPipeline(object): def__init__(self): self.host=settings['MONGODB_HOST'] self.port=settings['MONGODB_PORT'] defprocess_item(self,item,spider): #创建mongodb客户端连接对象,该例从settings.py文件里面获取mongodb所在的主机和端口参数,可直接书写主机和端口 self.client=pymongo.MongoClient(self.host,self.port) #创建数据库douban self.mydb=self.client['douban'] #在数据库douban里面创建表doubanmovies #把类似字典的数据转换为phthon字典格式 content=dict(item) #把数据添加到表里面 self.mysheetname.insert(content) returnitem
设置settings.py文件
#-*-coding:utf-8-*-
BOT_NAME='douban'
SPIDER_MODULES=['douban.spiders']
NEWSPIDER_MODULE='douban.spiders'
USER_AGENT='Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0;'
#Configureadelayforrequestsforthesamewebsite(default:0)
#Seehttps://doc.scrapy.org/en/latest/topics/settings.html#download-delay
#Seealsoautothrottlesettingsanddocs
DOWNLOAD_DELAY=3
#Thedownloaddelaysettingwillhonoronlyoneof:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
#Disablecookies(enabledbydefault)
COOKIES_ENABLED=False
#Configureitempipelines
#Seehttps://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES={
'douban.pipelines.DoubanPipeline':300,
}
#mongodb数据库设置变量
MONGODB_HOST='127.0.0.1'
MONGODB_PORT=27017
终端测试
scrapycrawldouban
这博客园的代码片段缩进,难道要用4个空格才可以搞定?我发现只能使用4个空格才能解决如上图的代码块的缩进
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持毛票票。