我试图在spark scala中关联一个Dataframe的两列,方法是将原始Dataframe的列管道化到vectorassembler中,然后使用correlation util。由于某些原因,向量汇编程序似乎正在生成空向量,如下所示。这是我目前掌握的情况。
val numericalCols = Array(
"price", "bedrooms", "bathrooms",
"sqft_living", "sqft_lot"
)
val data: DataFrame = HousingDataReader(spark)
data.printSchema()
/*
...
|-- price: decimal(38,18) (nullable = true)
|-- bedrooms: decimal(38,18) (nullable = true)
|-- bathrooms: decimal(38,18) (nullable = true)
|-- sqft_living: decimal(38,18) (nullable = true)
|-- sqft_lot: decimal(38,18) (nullable = true)
...
*/
println("total record:"+data.count()) //total record:21613
val assembler = new VectorAssembler().setInputCols(numericalCols)
.setOutputCol("features").setHandleInvalid("skip")
val df = assembler.transform(data).select("features","price")
df.printSchema()
/*
|-- features: vector (nullable = true)
|-- price: decimal(38,18) (nullable = true)
*/
df.show
/* THIS IS ODD
+--------+-----+
|features|price|
+--------+-----+
+--------+-----+
*/
println("df row count:" + df.count())
// df row count:21613
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head //ERROR HERE
println("Pearson correlation matrix:\n" + coeff1.toString)
最后出现以下异常
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:73)
at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:84)
at
1条答案
按热度按时间iih3973s1#
看起来任何一个功能列都始终包含空值。sethandleinvalid(“skip”)将跳过其中一个特性中包含null的任何行。你能用fillna(0)填充空值并检查结果吗。这一定能解决你的问题。