python 我如何用SHAP值来解释我的部分预测？(而不是每一个预测)

背景资料

我在我的训练数据上拟合一个分类器。当测试我拟合的最佳估计量时，我预测其中一个类的概率。我按照概率降序排列我的X_test和y_test。

问题

我想了解哪些特征对于分类器来说是重要的（以及在多大程度上），以便分类器只预测整体概率最高的500个预测，而不是每个预测。

y_test_probas = clf.predict_proba(X_test)[:, 1]

explainer = shap.Explainer(clf, X_train)  # <-- here I put the X which the classifier was trained on?

top_n_indices = np.argsort(y_test_probas)[-500:]

shap_values = explainer(X_test.iloc[top_n_indices])  # <-- here I put the X I want the SHAP values for?

shap.plots.bar(shap_values)

不幸的是，shap documentation (bar plot)不包括这种情况。有两件事是不同的：
1.它们使用训练分类器的数据（我想使用测试分类器的数据）
1.他们使用整个X而不是其中的一部分（我想只使用部分数据）

最小可重复示例

import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")

# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)

# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a logistic regression classifier
clf = RandomForestClassifier().fit(X_train, y_train)

# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]

# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]

# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)

# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])

# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)

我不想知道分类器如何决定所有预测，而是决定概率最高的预测。该代码适合回答这个问题吗？如果不适合，合适的代码是什么样子的？

首先：你的代码是正确的，有一个潜在的点要讨论（见下文）。

SHAP值如何计算？

您的模型是线性的，因此shap.Explainer()使用最简单形式的线性解释器。这意味着：观察的特征j的SHAP值简单地为
coef_j *（value_mean_j），
何处

coef_j：第j个回归系数
value：第一行的特征j的值
mean_j：传递给shap.Explainer()的数据集的100个随机采样观察值中的特征j的平均值

大谜团：计算平均值比二次采样便宜得多，那么为什么线性解释器在这种情况下要进行二次采样呢？哦，好吧...

示例

我们可以通过仅使用训练数据的前100行来验证上述计算，因此不进行二次采样：

top_1 = top_n_indices[0]
explainer = shap.Explainer(clf, X_train[0:100])
explainer(X_test.iloc[[top_1]])

这导致

.values =
array([[-0.71299643, -0.90613007,  0.53047584, -1.87629253, -0.21875983,
         0.04461552]])

.base_values =
array([-0.7436993])

.data =
array([[ 3. ,  1. , 14. ,  5. ,  2. , 46.9]])

我们可以通过手动计算轻松重建：

(X_test.iloc[[top_1]] - X_train[0:100].mean()) * clf.coef_

带输出：

Pclass       Sex        Age         Sib/Spo     Par/Child   Fare
-0.712996   -0.90613    0.530476    -1.876293   -0.21876    0.044616

讨论要点

解释器使用训练数据中随机抽取的100行数据来计算均值。这些数据对于您想要解释的数据来说可能非常不具有代表性，因此将背景数据传递给shap.Explainer()而不是感兴趣的数据X_test.iloc[top_n_indices, :]可能更正确。

python 我如何用SHAP值来解释我的部分预测？(而不是每一个预测)

1条答案

SHAP值如何计算？

示例

讨论要点

相关问题

热门标签

最新问答