利用Node.js制作爬取大众点评的爬虫

2024-03-12 10:03:04 362

前言

Node.js天生支持并发，但是对于习惯了顺序编程的人，一开始会对Node.js不适应，比如，变量作用域是函数块式的（与C、Java不一样）；for循环体（{}）内引用i的值实际上是循环结束之后的值，因而引起各种undefined的问题；嵌套函数时，内层函数的变量并不能及时传导到外层（因为是异步）等等。

一、 API分析

大众点评开放了查询餐馆信息的API，这里给出了城市与cityid之间的对应关系，

链接：http://m.api.dianping.com/searchshop.json?&regionid=0&start=0&categoryid=10&sortid=0&cityid=110

以GET方式给出了餐馆的信息（JSON格式）。

首先解释下GET参数的含义：

1、start为步进数，表示分步获取信息的index，与nextStartIndex字段相对应；

2、cityid表示城市id，比如，合肥对应于110；

3、regionid表示区域id，每一个id代表含义在start=0时rangeNavs字段中有解释；

4、categoryid表示搜索商家的分类id，比如，美食对应的id为10，具体每一个id的含义参见在start=0时categoryNavs字段；

5、sortid表示商家结果的排序方式，比如，0对应智能排序，2对应评价最好，具体每一个id的含义参见在start=0时sortNavs字段。

在GET返回的JSON串中list字段为商家列表，id表示商家的id，作为商家的唯一标识。在返回的JSON串中是没有商家的口味、环境、服务的评分信息以及经纬度的；

因而我们还需要爬取两个商家页面：http://m.dianping.com/shop/<id>、http://m.dianping.com/shop/<id>/map。

通过以上分析，确定爬取策略如下（与dianping_crawler的思路相类似）：

1、逐步爬取searchshopAPI的取商家基本信息列表；

2、通过爬取的所有商家的id，异步并发爬取评分信息、经纬度；

3、最后将三份数据通过id做聚合，输出成json文件。

二、爬虫实现

Node.js爬虫代码用到如下的第三方模块：

1、superagent，轻量级http请求库，模仿了浏览器登录；

2、cheerio，采用jQuery语法解析HTML元素，跟Python的PyQuery相类似；

3、async，牛逼闪闪的异步流程控制库，Node.js的必学库。

导入依赖库：

varutil=require("util");varsuperagent=require("superagent");varcheerio=require("cheerio");varasync=require("async");varfs=require('fs');

声明全局变量，用于存放配置项及中间结果：

varcityOptions={"cityId":110,//合肥//全部商区,蜀山区,庐阳区,包河区,政务区,瑶海区,高新区,经开区,滨湖新区,其他地区,肥西县"regionIds":[0,356,355,357,8840,354,8839,8841,8843,358,-922],"categoryId":10,//美食"sortId":2,//人气最高"threshHold":5000//最多餐馆数};varidVisited={};//usedtodistinctshopvarratingDict={};//id->ratingsvarposDict={};//id->pos

判断一个id是否在前面出现过，若object没有该id，则为undefined（注意不是null）：

functionisVisited(id){if(idVisited[id]!=undefined){returntrue;}else{idVisited[id]=true;returnfalse;}}

采取回调函数的方式，实现顺序逐步地递归调用爬虫函数：

functionDianpingSpider(regionId,start,callback){console.log('crawlingregion=',regionId,',start=',start);varsearchBase='http://m.api.dianping.com/searchshop.json?&regionid=%s&start=%s&categoryid=%s&sortid=%s&cityid=%s';varurl=util.format(searchBase,regionId,start,cityOptions.categoryId,cityOptions.sortId,cityOptions.cityId);superagent.get(url).end(function(err,res){if(err)returnconsole.err(err.stack);varrestaurants=[];vardata=JSON.parse(res.text);varshops=data['list'];shops.forEach(function(shop){varrestaurant={};if(!isVisited(shop['id'])){restaurant.id=shop['id'];restaurant.name=shop['name'];restaurant.branchName=shop['branchName'];varregex=/(.*?)(\d+)(.*)/g;if(shop['priceText'].match(regex)){restaurant.price=parseInt(regex.exec(shop['priceText'])[2]);}else{restaurant.price=shop['priceText'];}restaurant.star=shop['shopPower']/10;restaurant.category=shop['categoryName'];restaurant.region=shop['regionName'];restaurants.push(restaurant);}});varnextStart=data['nextStartIndex'];if(nextStart>start&&nextStart<cityOptions.threshHold){DianpingSpider(regionId,nextStart,function(err,restaurants2){if(err)returncallback(err);callback(null,restaurants.concat(restaurants2))});}else{callback(null,restaurants);}});}

在调用爬虫函数时，采用async的mapLimit函数实现对并发的控制；采用async的until对并发的协同处理，保证三份数据结果的id一致性（不会因为并发完成时间不一致而丢数据）：

DianpingSpider(0,0,function(err,restaurants){if(err)returnconsole.err(err.stack);varconcurrency=0;varcrawlMove=function(id,callback){vardelay=parseInt((Math.random()*30000000)%1000,10);concurrency++;console.log('currentconcurrency:',concurrency,',nowcrawlingid=',id,',costs(ms):',delay);parseShop(id);parseMap(id);setTimeout(function(){concurrency--;callback(null,id);},delay);};async.mapLimit(restaurants,5,function(restaurant,callback){crawlMove(restaurant.id,callback)},function(err,ids){console.log('crawledids:',ids);varresultArray=[];async.until(function(){returnrestaurants.length===Object.keys(ratingDict).length&&restaurants.length===Object.keys(posDict).length},function(callback){setTimeout(function(){callback(null)},1000)},function(err){restaurants.forEach(function(restaurant){varrating=ratingDict[restaurant.id];varpos=posDict[restaurant.id];varresult=Object.assign(restaurant,rating,pos);resultArray.push(result);});writeAsJson(resultArray);});});});

其中，parseShop与parseMap分别为解析商家详情页、商家地图页：

functionparseShop(id){varshopBase='http://m.dianping.com/shop/%s';varshopUrl=util.format(shopBase,id);superagent.get(shopUrl).end(function(err,res){if(err)returnconsole.err(err.stack);console.log('crawlingshop:',shopUrl);varrestaurant={};var$=cheerio.load(res.text);vardesc=$("div.shopInfoPagelet>div.desc>span");restaurant.taste=desc.eq(0).text().split(":")[1];restaurant.surrounding=desc.eq(1).text().split(":")[1];restaurant.service=desc.eq(2).text().split(":")[1];ratingDict[id]=restaurant;});}functionparseMap(id){varmapBase='http://m.dianping.com/shop/%s/map';varmapUrl=util.format(mapBase,id);superagent.get(mapUrl).end(function(err,res){if(err)returnconsole.err(err.stack);console.log('crawlingmap:',mapUrl);varrestaurant={};var$=cheerio.load(res.text);vardata=$("body>script").text();varlatRegex=/(.*lat:)(\d+.\d+)(.*)/;varlngRegex=/(.*lng:)(\d+.\d+)(.*)/;if(data.match(latRegex)&&data.match(lngRegex)){restaurant.latitude=latRegex.exec(data)[2];restaurant.longitude=lngRegex.exec(data)[2];}else{restaurant.latitude='';restaurant.longitude='';}posDict[id]=restaurant;});}

将array的每一个商家信息，逐行写入到json文件中：

functionwriteAsJson(arr){fs.writeFile('data.json',arr.map(function(data){returnJSON.stringify(data);}).join('\n'),function(err){if(err)returnerr.stack;})}

总结

以上就是这篇文章的全部内容，希望本文能给学习或者使用node.js的朋友们带来一定的帮助，如果有疑问大家可以留言交流。

利用Node.js制作爬取大众点评的爬虫

热门推荐

随机推荐