我有如下示例数据:
UserId,ProductId,Category,Action
1,111,Electronics,Browse
2,112,Fashion,Click
3,113,Kids,AddtoCart
4,114,Food,Purchase
5,115,Books,Logout
6,114,Food,Click
7,113,Kids,AddtoCart
8,115,Books,Purchase
9,111,Electronics,Click
10,112,Fashion,Purchase
3,112,Fashion,Click
我需要生成的用户谁是感兴趣的“时尚”或“电子”类别,但不是在这两个类别列表。用户是否有兴趣使用spark/scala代码执行这些操作(单击/addtocart/purchase),直到下面:
val rrd1 = sc.textFile("/user/harshit.kacker/datametica_logs.csv")
val rrd2 = rrd1.map( x=> {
| val c = x.split(",")
| (c(0).toInt , x)})
val rrd3 = rrd1.filter(x=> x.split(",")(2) == "Fashion" || x.split(",")(2) == "Electronics")
val rrd4 = rrd3.filter(x=> x.split(",")(3)== "Click" || x.split(",")(3)=="Purchase" || x.split(",")(3)=="AddtoCart")
rrd4.collect.foreach(println)
2,112,Fashion,Click
9,111,Electronics,Click
10,112,Fashion,Purchase
3,112,Fashion,Click
4,111,Electronics,Click
19,112,Fashion,Click
9,112,Fashion,Purchase
2,112,Fashion,Click
2,111,Electronics,Click
1,112,Fashion,Purchase
现在,我必须在“生成对“时尚”或“电子产品”类别感兴趣但对这两个类别都不感兴趣的用户列表”这一斜体部分中工作,并获得所需的输出:
10,Fashion
3,Fashion
4,Electronics
19,Fashion
1,Fashion
意味着时尚和电子类的用户ID应该被淘汰。我怎样才能达到同样的效果?
1条答案
按热度按时间yqhsw0fo1#
首先将输入文本文件解析为元组:
按兴趣条件筛选rdd:
分别用于时尚和电子产品的过滤器:
查找fashion和electronics之间的通用用户ID:
合并fashion和electronics行并过滤两者之间的共同ID:
编辑:使用Dataframe