ml算法

sqyvllje  于 2021-07-13  发布在  Spark
关注(0)|答案(0)|浏览(287)

我对scala和spark-ml都是新手。我正在尝试创建一个基于pyspark字符串匹配推荐的字符串匹配算法。基于此,到目前为止,我能够在下面实现

  1. import org.apache.spark.ml.Pipeline
  2. import org.apache.spark.sql._
  3. import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer}
  4. import spark.implicits._
  5. val vendorData = spark.read.option("header", "true").option("inferSchema", "true").json(path = "Data/*.json").as[vendorData]
  6. // Load IMDB file into an Dataset
  7. val imdbData = spark.read.option("header", "True").option("inferSchema", "True").option("sep", "\t").csv(path = "Data/title.basics.tsv").as[imdbData]
  8. // Remove Special chaacters
  9. val newVendorData = vendorData.withColumn("newtitle", functions.regexp_replace(vendorData.col("title"), "[^A-Za-z0-9_]",""))
  10. val newImdbData = imdbData.withColumn("newprimaryTitle", functions.regexp_replace(imdbData.col("primaryTitle"), "[^A-Za-z0-9_]", ""))
  11. //Algo to find match percentage
  12. val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")
  13. val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")
  14. val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")
  15. val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
  16. val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))
  17. val model = pipeline.fit(newVendorData.select("newtitle"))
  18. val vendorHashed = model.transform(newVendorData.select("newtitle"))
  19. val imdbHashed = model.transform(newImdbData.select("newprimaryTitle"))
  20. model.stages.last.asInstanceOf[ml.feature.MinHashLSHModel].approxSimilarityJoin(vendorHashed, imdbHashed, .85).show()

当我跑步的时候,我得到了下面的错误。经进一步调查,我发现问题出在第行:
val model=pipeline.fit(newvendordata.select(“newtitle”))
但看不见是什么。

  1. Exception in thread "main" java.lang.IllegalArgumentException: text does not exist. Available: newtitle
  2. at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
  3. at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:168)
  4. at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
  5. at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:109)
  6. at org.apache.spark.ml.Pipeline.$anonfun$transformSchema$4(Pipeline.scala:184)
  7. at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  8. at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  9. at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
  10. at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
  11. at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  12. at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
  13. at MatchingJob$.$anonfun$main$1(MatchingJob.scala:84)
  14. at MatchingJob$.$anonfun$main$1$adapted(MatchingJob.scala:43)
  15. at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  16. at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  17. at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  18. at MatchingJob$.main(MatchingJob.scala:43)
  19. at MatchingJob.main(MatchingJob.scala)

不知道我做错了什么。
我的意见如下:

  1. +------------------+
  2. | newtitle|
  3. +------------------+
  4. | BhaagMilkhaBhaag|
  5. | Fukrey|
  6. | DilTohBacchaHaiJi|
  7. |IndiasJungleHeroes|
  8. | HrudayaGeethe|
  9. **newprimaryTitle**
  10. BhaagMilkhaBhaag
  11. Fukrey
  12. Carmencita
  13. Leclownetseschiens
  14. PauvrePierrot
  15. Unbonbock
  16. BlacksmithScene
  17. ChineseOpiumDen
  18. DilTohBacchaHaiJi
  19. IndiasJungleHeroes
  20. CorbettandCourtne
  21. EdisonKinetoscopi
  22. MissJerry
  23. LeavingtheFactory
  24. AkrobatischesPotp
  25. TheArrivalofaTrain
  26. ThePhotographical
  27. TheWatererWatered
  28. Autourdunecabine
  29. Barquesortantduport
  30. ItalienischerBaue
  31. DasboxendeKnguruh
  32. TheClownBarber
  33. TheDerby1895

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题