MongoDB:MapReduce基础及实例
本文内容纲要:
-背景
-实例
-总结
背景
MapReduce是个非常灵活和强大的数据聚合工具。它的好处是可以把一个聚合任务分解为多个小的任务,分配到多服务器上并行处理。
MongoDB也提供了MapReduce,当然查询语肯定是JavaScript。MongoDB中的MapReduce主要有以下几阶段:
-
Map:把一个操作Map到集合中的每一个文档
-
Shuffle:根据Key分组对文档,并且为每个不同的Key生成一系列(>=1个)的值表(Listofvalues)。
-
Reduce:处理值表中的元素,直到值表中只有一个元素。然后将值表返回到Shuffle过程,循环处理,直到每个Key只对应一个值表,并且此值表中只有一个元素,这就是MR的结果。
-
Finalize:此步骤不是必须的。在得到MR最终结果后,再进行一些数据“修剪”性质的处理。
MongoDB中使用emit函数向MapReduce提供Key/Value对。
Reduce函数接受两个参数:Key,emits.Key即为emit函数中的Key。emits是一个数组,它的元素就是emit函数提供的Value。
Reduce函数的返回结果必须要能被Map或者Reduce重复使用,所以返回结果必须与emits中元素结构一致。
Map或者Reduce函数中的this关键字,代表当前被Mapping文档。
实例
测试数据:这个集合是三个用户购买的产品和产品价格的数据。
CodeCodefor(vari=0;i<1000;i++){
varrID=Math.floor(Math.random()*10);
varpriceparseFloat((Math.random()*10).toFixed(2));
if(rID<4){
db.test.insert({"user":"Joe","sku":rID,"price":price});
}
elseif(rID>=4&&rID<7)
{
db.test.insert({"user":"Josh","sku":rID,"price":price});
}
else{
db.test.insert({"user":"Ken","sku":rID,"price":price});
}
}
-
每个用户各购买了多少个产品?(<单一Key做MR)
Code//SQL实现 selectuser,count(sku)fromtest groupbyuser
//MapReduce实现 map=function(){ emit(this.user,{count:1}) }
reduce=function(key,values){ varcnt=0;
values.forEach(function(val){cnt+=val.count;});
return{"count":cnt}; } //MR结果存到集合mr1 db.test.mapReduce(map,reduce,{out:"mr1"}) //查看MR之后结果db.mr1.find() {"_id":"Joe","value":{"count":416}} {"_id":"Josh","value":{"count":287}} {"_id":"Ken","value":{"count":297}}
-
每个用户不同的产品购买了多少个?(复合Key做MR)
Code//SQL实现 selectuser,sku,count(*)fromtest groupbyuser,sku
//MapReduce实现 map=function(){ emit({user:this.user,sku:this.sku},{count:1}) }
reduce=function(key,values){ varcnt=0;
values.forEach(function(val){cnt+=val.count;});
return{"count":cnt}; }db.test.mapReduce(map,reduce,{out:"mr2"})
db.mr2.find() {"_id":{"user":"Joe","sku":0},"value":{"count":103}} {"_id":{"user":"Joe","sku":1},"value":{"count":106}} {"_id":{"user":"Joe","sku":2},"value":{"count":102}} {"_id":{"user":"Joe","sku":3},"value":{"count":105}} {"_id":{"user":"Josh","sku":4},"value":{"count":87}} {"_id":{"user":"Josh","sku":5},"value":{"count":107}} {"_id":{"user":"Josh","sku":6},"value":{"count":93}} {"_id":{"user":"Ken","sku":7},"value":{"count":98}} {"_id":{"user":"Ken","sku":8},"value":{"count":83}} {"_id":{"user":"Ken","sku":9},"value":{"count":116}}
-
每个用户购买的产品数量,总金额是多少?(复合Reduce结果处理)
Code//SQL实现 selectuser,count(sku),sum(price)fromtest groupbyuser
//MapReduce实现 map=function(){ emit(this.user,{amount:this.price,count:1}) }
reduce=function(key,values){ varres={amount:0,count:0} values.forEach(function(val){ res.amount+=val.amount; res.count+=val.count });
returnres; }db.test.mapReduce(map,reduce,{out:"mr3"})
db.mr3.find() {"_id":"Joe","value":{"amount":2053.8899999999994,"count":395}} {"_id":"Josh","value":{"amount":1409.2600000000002,"count":292}} {"_id":"Ken","value":{"amount":1547.7700000000002,"count":313}}
-
在3中返回的amount的float精度需要改成两位小数,还需要得到商品的平均价格。(使用Finalize处理reduce结果集)
Code//SQL实现 selectuser,cast(sum(price)asdecimal(10,2))asamount,count(sku)as[count], cast((sum(price)/count(sku))asdecimal(10,2))asavgPrice fromtest groupbyuser //MapReduce实现 map=function(){ emit(this.user,{amount:this.price,count:1,avgPrice:0}) }
reduce=function(key,values){ varres={amount:0,count:0,avgPrice:0} values.forEach(function(val){ res.amount+=val.amount; res.count+=val.count });
returnres; }finalizeFun=function(key,reduceResult){ reduceResult.amount=(reduceResult.amount).toFixed(2); reduceResult.avgPrice=(reduceResult.amount/reduceResult.count).toFixed(2); returnreduceResult;}
db.test.mapReduce(map,reduce,{out:"mr4",finalize:finalizeFun})
db.mr4.find() {"_id":"Joe","value":{"amount":"2053.89","count":395,"avgPrice":"5.20"}} {"_id":"Josh","value":{"amount":"1409.26","count":292,"avgPrice":"4.83"}} {"_id":"Ken","value":{"amount":"1547.77","count":313,"avgPrice":"4.94"}}
-
统计单价大于6的SKU,每个用户的购买数量.(筛选数据子集做MR)
这个比较简单了,只需要将1.中调用MR时加上筛选查询即可,其它不变.
Codedb.test.mapReduce(map,reduce,{query:{price:{"$gt":6}},out:"mr5"})
总结
MongoDB中的MR工具非常强大,文中的例子只是基础实例.结合Sharding后,多服务器并行做数据集合处理,才能真正显现其能力.
如果后续有时间,希望能总结和分享更多关于MongoDB,关于SQLServer的东西.
本文内容总结:背景,实例,总结,
原文链接:https://www.cnblogs.com/Joe-T/p/4264910.html