pandas 在python中执行rfec并理解输出

我正在使用pandas在python中执行rfecv。我的步数是1。从174个特征开始。我的函数调用如下

rfecv = RFECV(estimator=LogisticRegression(solver='lbfgs'), step=1, cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=44),scoring='recall',\
              min_features_to_select=30, verbose=0)
rfecv.fit(X_train, y['tag'])

字符串
rfecv返回的最佳特征数是89。我注意到cv_results_['mean_test_score']的长度是145。
不是应该是174-89=85吗？如果RFECV一次删除1个特性，最后得到174个特性中的89个，那么我觉得会有85个步骤（'mean_test_score'的长度）。

#adding some dummy example-------------------------

型
在下面的例子中，我们从150个特征开始。要选择的最小特征是3，并且它选择4个特征。但是，如果一次消除一个特征，为什么print (len(selector.cv_results_['std_test_score']))是148

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=150, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5, min_features_to_select=3)
selector = selector.fit(X, y)
print (selector.support_)
print (selector.ranking_)
print (selector.n_features_)

print (len(selector.cv_results_['std_test_score']))

型

在第一个示例中，您从174个要素开始，最后以89个要素作为最佳数量。cv_results_['mean_test_score']的长度为145是由于交叉验证过程。“RFECV方法不是一次只消除一个特征，而是在每一步执行交叉验证，以估计模型在不同数量的特征下的性能。因此，在特征选择过程中，它使用不同的特征子集多次评估模型。
从174个特征开始。RFECV开始特征消除过程，使用交叉验证（此处为10倍分层交叉验证）评估模型的性能，并记录每个步骤的平均测试得分。在消除一些特征之后，该过程进入第二步骤，其中特征的数量可能不同于174（取决于在第一步骤中消除了多少）。该过程将继续进行，直到达到停止条件，即min_features_to_select或性能没有显著提高的要素数。
cv_results_['mean_test_score']的长度将为您提供在特征排除过程中所采取的步骤数，此数目可能不等于所选特征的初始数目与最终数目之间的差。
在具有150个要素的第二个示例中，当您设置min_features_to_select=3时，该过程将选择至少3个要素。但是，如果在交叉验证过程中选择更多的特征可以获得更好的性能，则可能会选择更多的特征。因此，cv_results_['std_test_score']的长度为148，这表明RFECV已经在148个不同步骤（使用不同的特征子集）评估了模型的性能。
我创建了一个包含10个特性的简单示例来演示RFECV过程：

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

# Generate synthetic data with 10 features and 100 samples
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# Define the estimator and RFECV parameters
estimator = LogisticRegression(solver='lbfgs')
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=44)
step_size = 1
min_features_to_select = 3

# Create the RFECV object and fit it to the data
rfecv = RFECV(estimator=estimator, step=step_size, cv=cv, scoring='accuracy', 
              min_features_to_select=min_features_to_select, verbose=0)
rfecv.fit(X, y)

# Get the optimal number of features selected
optimal_num_features = rfecv.n_features_

# Get the mean test scores during the feature selection process
mean_test_scores = rfecv.cv_results_['mean_test_score']

# Print the results
print("Optimal number of features selected:", optimal_num_features)
print("Number of steps in RFECV:", len(mean_test_scores))

字符串
输出：

的数据
由于min_features_to_select=3，因此RFECV过程将选择至少3个特征。RFECV中的步骤数将取决于在每个步骤的交叉验证过程中要消除的特征数。

pandas 在python中执行rfec并理解输出

1条答案

相关问题

热门标签

最新问答