假设我有一个像这样的sparkDataframe
Row(Y=a, X1=3.2, X2=4.5)
我想要的是:
Row(Y=a, features=SparseVector(2, {X1: 3.2, X2: 4.5})
w41d8nur1#
也许这是有帮助的-用scala编写,但可以用pyspark实现,只需很少的修改
val df = spark.sql("select 'a' as Y, 3.2 as X1, 4.5 as X2") df.show(false) df.printSchema() /** * +---+---+---+ * |Y |X1 |X2 | * +---+---+---+ * |a |3.2|4.5| * +---+---+---+ * * root * |-- Y: string (nullable = false) * |-- X1: decimal(2,1) (nullable = false) * |-- X2: decimal(2,1) (nullable = false) */ import org.apache.spark.ml.feature.VectorAssembler val features = new VectorAssembler() .setInputCols(Array("X1", "X2")) .setOutputCol("features") .transform(df) features.show(false) features.printSchema() /** * +---+---+---+---------+ * |Y |X1 |X2 |features | * +---+---+---+---------+ * |a |3.2|4.5|[3.2,4.5]| * +---+---+---+---------+ * * root * |-- Y: string (nullable = false) * |-- X1: decimal(2,1) (nullable = false) * |-- X2: decimal(2,1) (nullable = false) * |-- features: vector (nullable = true) */
val df = spark.sql("select 'a' as Y, 3.2 as X1, 4.5 as X2")
df.show(false)
df.printSchema()
/**
* +---+---+---+
* |Y |X1 |X2 |
* |a |3.2|4.5|
*
* root
* |-- Y: string (nullable = false)
* |-- X1: decimal(2,1) (nullable = false)
* |-- X2: decimal(2,1) (nullable = false)
*/
import org.apache.spark.ml.feature.VectorAssembler
val features = new VectorAssembler()
.setInputCols(Array("X1", "X2"))
.setOutputCol("features")
.transform(df)
features.show(false)
features.printSchema()
* +---+---+---+---------+
* |Y |X1 |X2 |features |
* |a |3.2|4.5|[3.2,4.5]|
* |-- features: vector (nullable = true)
1条答案
按热度按时间w41d8nur1#
也许这是有帮助的-
用scala编写,但可以用pyspark实现,只需很少的修改
vectorassembler从输入列创建向量