scala(不是pyspark)将线性回归系数Map到特性名称(分类的和连续的)

o2gm4chl 于 2021-07-14 发布在 Spark

关注(0)|答案(1)|浏览(461)

scala中有一个Dataframe

df.show
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| id|group|  normalized_amount|query_id|                 y|      y1|group1|groupIndexed| groupEncoded|
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
|  1|    B|   0.22874172014806|       1| 0.317739988492575|       0|     B|         1.0|(2,[1],[1.0])|
|  2|    A|  -1.42432215217563|       2| -1.32008967486074|       0|     C|         0.0|(2,[0],[1.0])|
|  3|    B|  -2.03644548423379|       3| -1.65740392834359|       0|     B|         1.0|(2,[1],[1.0])|
|  4|    B|  0.425753803902096|       4|-0.127591370989296|       0|     C|         0.0|(2,[0],[1.0])|
|  5|    A|  0.521050829955076|       5| 0.824285664580579|       1|     A|         2.0|    (2,[],[])|
|  6|    A|-0.0416682439998418|       6| 0.321350404322885|       1|     C|         0.0|(2,[0],[1.0])|
|  7|    A|   -1.2787327462978|       7| -0.88099379032367|       0|     A|         2.0|    (2,[],[])|
|  8|    A|  0.431780409975322|       8| 0.575249966796747|       1|     C|         0.0|(2,[0],[1.0])|

我正在做一个线性回归 y 在 group1 （3个类别的分类变量）和 normalized_amount （连续变量）如下

var assembler = new VectorAssembler().setInputCols(Array("groupEncoded", "normalized_amount")).setOutputCol("features")
val dfFeatures = assembler.transform(df)
var lr = new LinearRegression()
var lrModel = lr.fit(dfFeatures)
var lrPrediction = lrModel.transform(dfFeatures)

我可以访问系数和标准误差如下

lmModel.intercept
lrModel.coefficients //model coefficient estimates (not intercept)
lrModel.summary.coefficientStandardErrors //standard error of intercept and coefficients, not sure in which order

我的问题是
如何计算出哪个特征对应于哪个系数估计（对于分类值，我需要计算出每个类别的系数）？与标准误差相同？
如何选择要“省略”的类别作为参考类别？
如何执行无截距的线性回归？
我见过一些类似问题的答案，但它们都在pyspark中，而不是scala中，我只使用scala

scala apache-spark Encoding regression categorical-data

来源：https://stackoverflow.com/questions/67015724/scala-not-pyspark-map-linear-regression-coefficients-to-feature-names-categor

1条答案

按热度按时间

mfpqipee1#

使用Dataframe作为转换的df，包括预测和logisticregressionmodel，您可以访问vectorsembler字段的属性。这段来自databricks的代码，我稍微将其修改为logisticsregressionmodel，而不是pipeline。请注意，您可以选择是否需要截距估计：

val lrToFit : LinearRegression = ???
lrToFit.setFitIntercept(false)

// With this dataframe as your transformed df that includes the prediction
val df: DataFrame = ???
val lr : LogisticRegressionModel = ???
val schema = df.schema

// Using the schema, the attributes of the Vector Assembler(features) can be extracted
val features = AttributeGroup.fromStructField(schema(lr.getFeaturesCol)).attributes.get.map(_.name.get)
val featureNames: Array[String] = if (lr.getFitIntercept) {
  Array("(Intercept)") ++ features
} else {
  features
}

val coefficients = lr.coefficients.toArray
val coeffs = if (lr.getFitIntercept) {
  coefficients ++ Array(lr.intercept)
} else {
  coefficients
}

featureNames.zip(coeffs).foreach { case (feature, coeff) =>
  println(s"$feature\t$coeff")
}

如果加载预训练模型，则可以使用此方法，因为在这种情况下，您可能不知道向量汇编程序转换中特征的顺序。我认为您需要手动选择参考类别。

赞(0）回复(0）举报 2021-07-14

我来回答

scala(不是pyspark)将线性回归系数Map到特性名称(分类的和连续的)

1条答案

相关问题

热门标签

最新问答