fp增长算法

ct2axkht  于 2021-06-28  发布在  Hive
关注(0)|答案(1)|浏览(433)

下面是我从配置单元表生成频繁项集的代码

  1. val sparkConf = new SparkConf().setAppName("Recommender").setMaster("local")
  2. val sc = new SparkContext(sparkConf)
  3. val hiveContext = new HiveContext(sc)
  4. import hiveContext.implicits._
  5. import hiveContext.sql
  6. val schema = new StructType(Array(
  7. StructField("col1", StringType, false)
  8. ))
  9. val dataRow = hiveContext.sql("select col1 from hive_table limit 100000").cache()
  10. val dataRDD = hiveContext.createDataFrame(dataRow.rdd,schema).cache()
  11. dataRDD.show()
  12. val transactions = dataRDD.map((row:Row) => {
  13. val stringarray=row.getAs[String](0).split(",")
  14. var arr=new Array[String](stringarray.length)
  15. for( a <- 0 to arr.length-1) {
  16. arr(a)=stringarray(a)
  17. }
  18. arr
  19. })
  20. val fpg = new FPGrowth().setMinSupport(0.1).setNumPartitions(10)
  21. val model = fpg.run(transactions)
  22. val size: Double = transactions.count()
  23. println("MODEL FreqItemCount "+model.freqItemsets.count())
  24. println("Transactions count : "+size)

但frequeitemcount结果总是为0。
输入查询结果如下图所示

  1. 270035_1,249134_1,929747_1
  2. 259138_1,44072_1,326046_1
  3. 385448_1,747230_1,74440_1,68096_1,610434_1,215589_3,999507_1,74439_1,36260_1,925018_1,588394_1,986622_1,64585_1,942893_1,5421_1,37041_1,52500_1,4925_1,553613 415353_1,600036_1,75955_1
  4. 693780_1,31379_1
  5. 465624_1,28993_1,1899_2,823631_1
  6. 667863_1,95623_3,345830_8,168966_1
  7. 837337_1,95586_1,350341_1,67379_1,837347_1,20556_1,17567_1,77713_1,361216_1,39535_1,525748_1,646241_1,346425_1,219266_1,77717_1,179382_3,702935_1
  8. 249882_1,28977_1
  9. 78025_1,113415_1,136718_1,640967_1,787444_1
  10. 193307_1,266303_1,220199_2,459193_1,352411_1,371579_1,45906_1,505334_1,9816_1,12627_1,135294_1,28182_1,132470_1
  11. 526260_1,305646_1,65438_1

但是当我用下面的硬编码输入执行代码时,我得到了正确的频繁项集

  1. val transactions = sc.parallelize(Seq(
  2. Array("Tuna", "Banana", "Strawberry"),
  3. Array("Melon", "Milk", "Bread", "Strawberry"),
  4. Array("Melon", "Kiwi", "Bread"),
  5. Array("Bread", "Banana", "Strawberry"),
  6. Array("Milk", "Tuna", "Tomato"),
  7. Array("Pepper", "Melon", "Tomato"),
  8. Array("Milk", "Strawberry", "Kiwi"),
  9. Array("Kiwi", "Banana", "Tuna"),
  10. Array("Pepper", "Melon")
  11. ))

你能告诉我我做错了什么吗?我正在使用spark 1.6.2和scala 2.10。

svujldwt

svujldwt1#

问题的根源似乎是支持阈值(0.1)过高。您使用的值非常高,不太可能在实际事务数据中观察到。试着逐渐减少它,直到你开始得到规则。

相关问题