chisqselector选择了错误的特性?

2izufjch  于 2021-07-14  发布在  Java
关注(0)|答案(1)|浏览(476)

我从我的文档中复制粘贴了这个示例 Spark 2.3.0 贝壳。

  1. import org.apache.spark.ml.feature.ChiSqSelector
  2. import org.apache.spark.ml.linalg.Vectors
  3. val data = Seq(
  4. (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  5. (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  6. (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
  7. )
  8. val df = spark.createDataset(data).toDF("id", "features", "clicked")
  9. val selector = new ChiSqSelector()
  10. .setNumTopFeatures(1)
  11. .setFeaturesCol("features")
  12. .setLabelCol("clicked")
  13. .setOutputCol("selectedFeatures")
  14. val selectorModel = selector.fit(df)
  15. val result = selectorModel.transform(df)
  16. result.show
  17. +---+------------------+-------+----------------+
  18. | id| features|clicked|selectedFeatures|
  19. +---+------------------+-------+----------------+
  20. | 7|[0.0,0.0,18.0,1.0]| 1.0| [18.0]|
  21. | 8|[0.0,1.0,12.0,0.0]| 0.0| [12.0]|
  22. | 9|[1.0,0.0,15.0,0.1]| 0.0| [15.0]|
  23. +---+------------------+-------+----------------+
  24. selectorModel.selectedFeatures
  25. res2: Array[Int] = Array(2)
  26. ``` `ChiSqSelector` 误摘 `feature 2` 而不是 `feature 3` (根据文档和常识,特征3应该是正确的)
lkaoscv7

lkaoscv71#

卡方特征选择对分类数据进行操作 ChiSqSelector 代表卡方特征选择。它对具有分类特征的标记数据进行操作
因此,这两个特征同样好(尽管我们应该强调,这两个特征即使用作连续变量,也可以用来导出平凡的完美分类器)。

  1. import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
  2. import org.apache.spark.mllib.regression.LabeledPoint
  3. import org.apache.spark.mllib.stat.Statistics
  4. Statistics.chiSqTest(sc.parallelize(data.map {
  5. case (_, v, l) => LabeledPoint(l, OldVectors.fromML(v))
  6. })).slice(2, 4)
  1. Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
  2. Array(Chi squared test summary:
  3. method: pearson
  4. degrees of freedom = 2
  5. statistic = 3.0
  6. pValue = 0.22313016014843035
  7. No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
  8. method: pearson
  9. degrees of freedom = 2
  10. statistic = 3.0000000000000004
  11. pValue = 0.22313016014843035
  12. No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..)

测试结果与其他工具一致。例如在r中(用作选择器测试的参考):

  1. y <- as.factor(c("1.0", "0.0", "0.0"))
  2. x2 <- as.factor(c("18.0", "12.0", "15.0"))
  3. x3 <- as.factor(c("1.0", "0.0", "0.1"))
  4. chisq.test(table(x2, y))
  1. Pearson's Chi-squared test
  2. data: table(x2, y)
  3. X-squared = 3, df = 2, p-value = 0.2231
  4. Warning message:
  5. In chisq.test(table(x2, y)) : Chi-squared approximation may be incorrect
  1. chisq.test(table(x3, y))
  1. Pearson's Chi-squared test
  2. data: table(x3, y)
  3. X-squared = 3, df = 2, p-value = 0.2231
  4. Warning message:
  5. In chisq.test(table(x3, y)) : Chi-squared approximation may be incorrect

因为选择器只是按p值和 sortBy 是稳定的,是先到先得。如果您切换特征的顺序,将选择另一个。

展开查看全部

相关问题