python爬虫框架talonspider简单介绍
1.为什么写这个?
一些简单的页面,无需用比较大的框架来进行爬取,自己纯手写又比较麻烦
因此针对这个需求写了talonspider:
•1.针对单页面的item提取-具体介绍点这里
•2.spider模块-具体介绍点这里
2.介绍&&使用
2.1.item
这个模块是可以独立使用的,对于一些请求比较简单的网站(比如只需要get请求),单单只用这个模块就可以快速地编写出你想要的爬虫,比如(以下使用python3,python2见examples目录):
2.1.1.单页面单目标
比如要获取这个网址http://book.qidian.com/info/1004608738的书籍信息,封面等信息,可直接这样写:
importtime fromtalonspiderimportItem,TextField,AttrField frompprintimportpprint classTestSpider(Item): title=TextField(css_select='.book-info>h1>em') author=TextField(css_select='a.writer') cover=AttrField(css_select='a#bookImg>img',attr='src') deftal_title(self,title): returntitle deftal_cover(self,cover): return'http:'+cover if__name__=='__main__': item_data=TestSpider.get_item(url='http://book.qidian.com/info/1004608738') pprint(item_data)
具体见qidian_details_by_item.py
2.1.1.单页面多目标
比如获取豆瓣250电影首页展示的25部电影,这一个页面有25个目标,可直接这样写:
fromtalonspiderimportItem,TextField,AttrField frompprintimportpprint #定义继承自item的爬虫类 classDoubanSpider(Item): target_item=TextField(css_select='div.item') title=TextField(css_select='span.title') cover=AttrField(css_select='div.pic>a>img',attr='src') abstract=TextField(css_select='span.inq') deftal_title(self,title): ifisinstance(title,str): returntitle else: return''.join([i.text.strip().replace('\xa0','')foriintitle]) if__name__=='__main__': items_data=DoubanSpider.get_items(url='https://movie.douban.com/top250') result=[] foriteminitems_data: result.append({ 'title':item.title, 'cover':item.cover, 'abstract':item.abstract, }) pprint(result)
具体见douban_page_by_item.py
2.2.spider
当需要爬取有层次的页面时,比如爬取豆瓣250全部电影,这时候spider部分就派上了用场:
#!/usr/bin/envpython fromtalonspiderimportSpider,Item,TextField,AttrField,Request fromtalonspider.utilsimportget_random_user_agent #定义继承自item的爬虫类 classDoubanItem(Item): target_item=TextField(css_select='div.item') title=TextField(css_select='span.title') cover=AttrField(css_select='div.pic>a>img',attr='src') abstract=TextField(css_select='span.inq') deftal_title(self,title): ifisinstance(title,str): returntitle else: return''.join([i.text.strip().replace('\xa0','')foriintitle]) classDoubanSpider(Spider): #定义起始url,必须 start_urls=['https://movie.douban.com/top250'] #requests配置 request_config={ 'RETRIES':3, 'DELAY':0, 'TIMEOUT':20 } #解析函数必须有 defparse(self,html): #将html转化为etree etree=self.e_html(html) #提取目标值生成新的url pages=[i.get('href')foriinetree.cssselect('.paginator>a')] pages.insert(0,'?start=0&filter=') headers={ "User-Agent":get_random_user_agent() } forpageinpages: url=self.start_urls[0]+page yieldRequest(url,request_config=self.request_config,headers=headers,callback=self.parse_item) defparse_item(self,html): items_data=DoubanItem.get_items(html=html) #result=[] foriteminitems_data: #result.append({ #'title':item.title, #'cover':item.cover, #'abstract':item.abstract, #}) #保存 withopen('douban250.txt','a+')asf: f.writelines(item.title+'\n') if__name__=='__main__': DoubanSpider.start()
控制台:
/Users/howie/anaconda3/envs/work3/bin/python/Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py 2017-06-0723:17:30,346-talonspider-INFO:talonspiderstarted 2017-06-0723:17:30,693-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250 2017-06-0723:17:31,074-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=25&filter= 2017-06-0723:17:31,416-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=50&filter= 2017-06-0723:17:31,853-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=75&filter= 2017-06-0723:17:32,523-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=100&filter= 2017-06-0723:17:33,032-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=125&filter= 2017-06-0723:17:33,537-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=150&filter= 2017-06-0723:17:33,990-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=175&filter= 2017-06-0723:17:34,406-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=200&filter= 2017-06-0723:17:34,787-talonspider_requests-INFO:GETaurl:https://movie.douban.com/top250?start=225&filter= 2017-06-0723:17:34,809-talonspider-INFO:Timeusage:0:00:04.462108 Processfinishedwithexitcode0
此时当前目录会生成douban250.txt,具体见douban_page_by_spider.py。
3.说明
学习之作,待完善的地方还有很多,欢迎提意见,项目地址talonspider。