Python使用mongodb保存爬取豆瓣电影的数据过程解析
创建爬虫项目douban
scrapystartprojectdouban
设置items.py文件,存储要保存的数据类型和字段名称
#-*-coding:utf-8-*- importscrapy classDoubanItem(scrapy.Item): title=scrapy.Field() #内容 content=scrapy.Field() #评分 rating_num=scrapy.Field() #简介 quote=scrapy.Field()
设置爬虫文件doubanmovies.py
#-*-coding:utf-8-*- importscrapy fromdouban.itemsimportDoubanItem classDoubanmoviesSpider(scrapy.Spider): name='doubanmovies' allowed_domains=['movie.douban.com'] offset=0 url='https://movie.douban.com/top250?start=' start_urls=[url+str(offset)] defparse(self,response): #print('*'*60) #print(response.url) #print('*'*60) item=DoubanItem() info=response.xpath("//div[@class='info']") foreachininfo: item['title']=each.xpath(".//span[@class='title'][1]/text()").extract() item['content']=each.xpath(".//div[@class='bd']/p[1]/text()").extract() item['rating_num']=each.xpath(".//span[@class='rating_num']/text()").extract() item['quote']=each.xpath(".//span[@class='inq']/text()").extract() yielditem #print(item) self.offset+=25 ifself.offset<=250: yieldscrapy.Request(self.url+str(self.offset),callback=self.parse)
设置管道文件,使用mongodb数据库来保存爬取的数据。重点部分
#-*-coding:utf-8-*- fromscrapy.confimportsettings importpymongo classDoubanPipeline(object): def__init__(self): self.host=settings['MONGODB_HOST'] self.port=settings['MONGODB_PORT'] defprocess_item(self,item,spider): #创建mongodb客户端连接对象,该例从settings.py文件里面获取mongodb所在的主机和端口参数,可直接书写主机和端口 self.client=pymongo.MongoClient(self.host,self.port) #创建数据库douban self.mydb=self.client['douban'] #在数据库douban里面创建表doubanmovies #把类似字典的数据转换为phthon字典格式 content=dict(item) #把数据添加到表里面 self.mysheetname.insert(content) returnitem
设置settings.py文件
#-*-coding:utf-8-*- BOT_NAME='douban' SPIDER_MODULES=['douban.spiders'] NEWSPIDER_MODULE='douban.spiders' USER_AGENT='Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0;' #Configureadelayforrequestsforthesamewebsite(default:0) #Seehttps://doc.scrapy.org/en/latest/topics/settings.html#download-delay #Seealsoautothrottlesettingsanddocs DOWNLOAD_DELAY=3 #Thedownloaddelaysettingwillhonoronlyoneof: #CONCURRENT_REQUESTS_PER_DOMAIN=16 #CONCURRENT_REQUESTS_PER_IP=16 #Disablecookies(enabledbydefault) COOKIES_ENABLED=False #Configureitempipelines #Seehttps://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES={ 'douban.pipelines.DoubanPipeline':300, } #mongodb数据库设置变量 MONGODB_HOST='127.0.0.1' MONGODB_PORT=27017
终端测试
scrapycrawldouban
这博客园的代码片段缩进,难道要用4个空格才可以搞定?我发现只能使用4个空格才能解决如上图的代码块的缩进
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持毛票票。