如何使用spark ML计算pyspark分类模型中的基尼指数?

pokxtpni  于 2023-10-15  发布在  Spark
关注(0)|答案(2)|浏览(133)

我试图计算gini指数的分类模型使用GBTClassifier从pyspark ml模型。我似乎找不到一个像python sklearn那样给出roc_auc_score的指标。
下面是我到目前为止在数据砖上使用的代码。我目前使用的是数据砖中的数据集

%fs ls databricks-datasets/adult/adult.data

from pyspark.sql.functions import *
from pyspark.ml.classification import  RandomForestClassifier, GBTClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, VectorSlicer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

dataset = spark.table("adult")
# spliting the train and test data frames 
splits = dataset.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

def churn_predictions(train_df,
                     target_col, 
#                      algorithm, 
#                      model_parameters = conf['model_parameters']
                    ):
  """
  #Function attributes
  dataframe        - training df
  target           - target varibale in the model
  Algorithm        - Algorithm used 
  model_parameters - model parameters used to fine tune the model
  """

  # one hot encoding and assembling
  encoding_var = [i[0] for i in train_df.dtypes if (i[1]=='string') & (i[0]!=target_col)]
  num_var = [i[0] for i in train_df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!=target_col)]

  string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
  onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
  label_indexes = StringIndexer(inputCol = target_col, outputCol = 'label', handleInvalid = 'keep')
  assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
  gbt = GBTClassifier(featuresCol = 'features', labelCol = 'label',
                     maxDepth = 5, 
                     maxBins  = 45,
                     maxIter  = 20)

  pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, gbt])
  model = pipe.fit(train_df)

  return model  

gbt_model = churn_predictions(train_df = train_df,
                     target_col = 'income')

#### prediction in test sample ####
gbt_predictions = gbt_model.transform(test_df)
# display(gbt_predictions)
gbt_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")

accuracy = gbt_evaluator.evaluate(gbt_predictions) * 100
print("Accuracy on test data = %g" % accuracy)

gini_train = 2 * metrics.roc_auc_score(Y, pred_prob) - 1

正如您在最后一行代码中看到的,显然没有称为roc_auc_score的度量来计算基尼系数。
我真的很感激你的帮助。

t0ybt7op

t0ybt7op1#

基尼系数通常用于评估二元分类模型。
你可以用下面的方法在pyspark中计算它:

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(gbt_predictions, {evaluator.metricName: "areaUnderROC"})
gini = 2 * auc - 1.0
jk9hmnmh

jk9hmnmh2#

在PySpark中,获得ROC AUC分数可能与sklearn略有不同。
MulticlassClassificationEvaluator替换为BinaryClassificationEvaluator

gbt_evaluator = BinaryClassificationEvaluator(
labelCol="label", rawPredictionCol="rawPrediction",   metricName="areaUnderROC")

这里,注意从 predictionColrawPredictionCol 的变化。rawPredictionCol 包含原始预测值,即阳性类别的得分/概率,用于计算ROC AUC得分。
计算基尼系数:
roc_auc = gbt_evaluator.evaluate(gbt_predictions)
gini = 2*roc_auc - 1

相关问题