矢量汇编程序如何与sparks correlation util一起使用?

esyap4oy  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(470)

我试图在spark scala中关联一个Dataframe的两列,方法是将原始Dataframe的列管道化到vectorassembler中,然后使用correlation util。由于某些原因,向量汇编程序似乎正在生成空向量,如下所示。这是我目前掌握的情况。

val numericalCols = Array(
      "price", "bedrooms", "bathrooms", 
      "sqft_living", "sqft_lot"
    )

    val data: DataFrame = HousingDataReader(spark)
    data.printSchema()
    /*
...
 |-- price: decimal(38,18) (nullable = true)
 |-- bedrooms: decimal(38,18) (nullable = true)
 |-- bathrooms: decimal(38,18) (nullable = true)
 |-- sqft_living: decimal(38,18) (nullable = true)
 |-- sqft_lot: decimal(38,18) (nullable = true)
...
     */

    println("total record:"+data.count()) //total record:21613

    val assembler = new VectorAssembler().setInputCols(numericalCols)
         .setOutputCol("features").setHandleInvalid("skip")

    val df = assembler.transform(data).select("features","price")
    df.printSchema()
    /*
 |-- features: vector (nullable = true)
 |-- price: decimal(38,18) (nullable = true)
     */
    df.show
    /*  THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
     */
    println("df row count:" + df.count())
    // df row count:21613
    val Row(coeff1: Matrix) = Correlation.corr(df, "features").head  //ERROR HERE

    println("Pearson correlation matrix:\n" + coeff1.toString)

最后出现以下异常

java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.

    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
    at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
    at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
    at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
    at
iih3973s

iih3973s1#

看起来任何一个功能列都始终包含空值。sethandleinvalid(“skip”)将跳过其中一个特性中包含null的任何行。你能用fillna(0)填充空值并检查结果吗。这一定能解决你的问题。

相关问题