scala中两个map rdd的交集

sauutmhj  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(450)

例如,我有两个rdd:firstmaprdd-(0-14,list(0,4,19,19079,42697,4444748))
secondmaprdd-(0-14,列表(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,78,79,80,81,82,83,84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94))
我要找到十字路口。我试过,var interresult=firstmaprdd.intersection(secondmaprdd),它在输出文件中没有显示结果。
我也尝试过,基于键的cogroup,maprdd.cogroup(secondmaprdd.filter(x=>),但是我不知道如何找到两个值之间的交集,是x=>x.。\u1.intersect(x.。\u2),有人能帮我学语法吗?
即使这样,也会抛出编译时错误maprdd.cogroup(secondmaprdd.filter(x=>x.。\u1.intersect(x.。\u2))

var mapRDD = sc.parallelize(map.toList)
 var secondMapRDD = sc.parallelize(secondMap.toList)
 var interResult = mapRDD.intersection(secondMapRDD)

这可能是因为arraybuffer[list[]]值,因为该值交叉点不起作用。有什么黑客可以移除它吗?
我试过这么做

var interResult = mapRDD.cogroup(secondMapRDD).filter{case (_, (l,r))    => l.nonEmpty && r.nonEmpty }. map{case (k,(l,r)) => (k, l.toList.intersect(r.toList))}

仍然得到一个空名单!

2vuwiymt

2vuwiymt1#

既然你在找 intersect on values ,你需要 join 两个RDD,获取所有匹配的值,然后对值进行交集。
示例代码:

val firstMap = Map(1 -> List(1,2,3,4,5))
  val secondMap = Map(1 -> List(1,2,5))

  val firstKeyRDD = sparkContext.parallelize(firstMap.toList, 2)
  val secondKeyRDD = sparkContext.parallelize(secondMap.toList, 2)

  val joinedRDD = firstKeyRDD.join(secondKeyRDD)
  val finalResult = joinedRDD.map(tuple => {
    val matchedLists = tuple._2
    val intersectValues = matchedLists._1.intersect(matchedLists._2)
    (tuple._1, intersectValues)
  })

  finalResult.foreach(println)

输出将是

(1,List(1, 2, 5))

相关问题