chisqselector选择了错误的特性？

2izufjch 于 2021-07-14 发布在 Java

关注(0)|答案(1)|浏览(476)

我从我的文档中复制粘贴了这个示例 Spark 2.3.0 贝壳。

import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
val df = spark.createDataset(data).toDF("id", "features", "clicked")
val selector = new ChiSqSelector()
  .setNumTopFeatures(1)
  .setFeaturesCol("features")
  .setLabelCol("clicked")
  .setOutputCol("selectedFeatures")
val selectorModel = selector.fit(df)
val result = selectorModel.transform(df)
result.show
+---+------------------+-------+----------------+
| id|          features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
|  7|[0.0,0.0,18.0,1.0]|    1.0|          [18.0]|
|  8|[0.0,1.0,12.0,0.0]|    0.0|          [12.0]|
|  9|[1.0,0.0,15.0,0.1]|    0.0|          [15.0]|
+---+------------------+-------+----------------+
selectorModel.selectedFeatures
res2: Array[Int] = Array(2)
``` `ChiSqSelector` 误摘 `feature 2` 而不是 `feature 3` （根据文档和常识，特征3应该是正确的）

scala apache-spark apache-spark-ml feature-selection chi-squared

来源：https://stackoverflow.com/questions/54775423/chisqselector-picks-the-wrong-feature

1条答案

按热度按时间

lkaoscv71#

卡方特征选择对分类数据进行操作 ChiSqSelector 代表卡方特征选择。它对具有分类特征的标记数据进行操作
因此，这两个特征同样好（尽管我们应该强调，这两个特征即使用作连续变量，也可以用来导出平凡的完美分类器）。

import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics
Statistics.chiSqTest(sc.parallelize(data.map { 
  case (_, v, l) => LabeledPoint(l, OldVectors.fromML(v)) 
})).slice(2, 4)

Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
Array(Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 3.0
pValue = 0.22313016014843035
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 3.0000000000000004
pValue = 0.22313016014843035
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..)

测试结果与其他工具一致。例如在r中（用作选择器测试的参考）：

y <- as.factor(c("1.0", "0.0", "0.0"))
x2 <- as.factor(c("18.0", "12.0", "15.0"))
x3 <- as.factor(c("1.0", "0.0", "0.1"))
chisq.test(table(x2, y))

Pearson's Chi-squared test
data:  table(x2, y)
X-squared = 3, df = 2, p-value = 0.2231
Warning message:
In chisq.test(table(x2, y)) : Chi-squared approximation may be incorrect

chisq.test(table(x3, y))

Pearson's Chi-squared test
data:  table(x3, y)
X-squared = 3, df = 2, p-value = 0.2231
Warning message:
In chisq.test(table(x3, y)) : Chi-squared approximation may be incorrect

因为选择器只是按p值和 sortBy 是稳定的，是先到先得。如果您切换特征的顺序，将选择另一个。

展开查看全部

赞(0）回复(0）举报 2021-07-14

我来回答

chisqselector选择了错误的特性？

1条答案

相关问题

热门标签

最新问答