通过StackingCVClassifier问题堆叠分类器(sklearn和keras模型)

wtzytmuj  于 2023-01-02  发布在  其他
关注(0)|答案(2)|浏览(330)

我对使用mlxtend包和Keras包还不太熟悉,所以请耐心听我说。我一直在尝试使用x1m5 n1,合并各种模型的预测,例如Random ForestLogistic RegressionNeural Network模型。我正在尝试堆叠这些在不同特征子集上操作的分类器。请参见以下代码。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from tensorflow import keras
from keras import layers
from keras.constraints import maxnorm
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation,  Flatten, Input
from mlxtend.classifier import StackingCVClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.neural_network import MLPClassifier

X, y = make_classification()
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# defining neural network model
def create_model ():
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim=10, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Flatten())
    optimizer= keras.optimizers.RMSprop(lr=0.001)
    model.add(Dense(units = 1, activation = 'sigmoid'))  # Compile model
    model.compile(loss='binary_crossentropy',
                  optimizer=optimizer, metrics=[keras.metrics.AUC(), 'accuracy'])
    return model

# using KerasClassifier on the neural network model
NN_clf=KerasClassifier(build_fn=create_model, epochs=5, batch_size= 5)
NN_clf._estimator_type = "classifier"

# stacking of classifiers that operate on different feature subsets
pipeline1 = make_pipeline(ColumnSelector(cols=(np.arange(0, 5, 1))), LogisticRegression())
pipeline2 = make_pipeline(ColumnSelector(cols=(np.arange(5, 10, 1))), RandomForestClassifier())
pipeline3 = make_pipeline(ColumnSelector(cols=(np.arange(10, 20, 1))), NN_clf)

# final stacking
clf = StackingCVClassifier(classifiers=[pipeline1, pipeline2, pipeline3], meta_classifier=MLPClassifier())
clf.fit(X_train, y_train)

print("Stacking model score: %.3f" % clf.score(X_val, y_val))

但是,我得到这个错误:

ValueError                                Traceback (most recent call last)
<ipython-input-11-ef342536824f> in <module>
     42 # final stacking
     43 clf = StackingCVClassifier(classifiers=[pipeline1, pipeline2, pipeline3], meta_classifier=MLPClassifier())
---> 44 clf.fit(X_train, y_train)
     45 
     46 print("Stacking model score: %.3f" % clf.score(X_val, y_val))

~\anaconda3\lib\site-packages\mlxtend\classifier\stacking_cv_classification.py in fit(self, X, y, groups, sample_weight)
    282                 meta_features = prediction
    283             else:
--> 284                 meta_features = np.column_stack((meta_features, prediction))
    285 
    286         if self.store_train_meta_features:

~\anaconda3\lib\site-packages\numpy\core\overrides.py in column_stack(*args, **kwargs)

~\anaconda3\lib\site-packages\numpy\lib\shape_base.py in column_stack(tup)
    654             arr = array(arr, copy=False, subok=True, ndmin=2).T
    655         arrays.append(arr)
--> 656     return _nx.concatenate(arrays, 1)
    657 
    658 

~\anaconda3\lib\site-packages\numpy\core\overrides.py in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)

帮帮我。谢谢!

ohfgkhjo

ohfgkhjo1#

出现错误是因为您将传统ML模型和DL模型的预测结合在一起。
ML模型以(80,1)的形状给出预测,而DL模型以(80,1,1)的形状给出预测,因此在尝试附加所有预测时存在失配。
此问题的常见解决方法是去除DL方法给出的预测的额外维度,使其变为(80,1)而不是(80,1,1)
因此,打开位于以下位置的py文件:anaconda3\lib\site-packages\mlxtend\classifier\stacking_cv_classification.py
if块之外的第280和356行中,添加以下内容:

prediction = prediction.squeeze(axis=1) if len(prediction.shape)>2 else prediction

所以,它看起来像这样:

...
...
...
if not self.use_probas:
    prediction = prediction[:, np.newaxis]
elif self.drop_proba_col == "last":
    prediction = prediction[:, :-1]
elif self.drop_proba_col == "first":
    prediction = prediction[:, 1:]
prediction = prediction.squeeze(axis=1) if len(prediction.shape)>2 else prediction

if meta_features is None:
    meta_features = prediction
else:
    meta_features = np.column_stack((meta_features, prediction))
...
...
...

for model in self.clfs_:
    if not self.use_probas:
        prediction = model.predict(X)[:, np.newaxis]
    else:
        if self.drop_proba_col == "last":
            prediction = model.predict_proba(X)[:, :-1]
        elif self.drop_proba_col == "first":
            prediction = model.predict_proba(X)[:, 1:]
        else:
            prediction = model.predict_proba(X)
    prediction = prediction.squeeze(axis=1) if len(prediction.shape)>2 else prediction
    per_model_preds.append(prediction)
...
...
...
ekqde3dh

ekqde3dh2#

Prakash's answer提出了非常好的观点。
如果您希望在不做太多更改的情况下运行该程序,可以滚动您自己版本的scikit-learn BaseEstimator/ClassifierMixin对象,或者 Package 在推荐的KerasClassifier对象中。
也就是说,您可以像这样滚动自己的估计器:

class MyKerasModel(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        model = keras.Sequential()
        model.add(layers.Input(shape=X.shape[1]))
        model.add(layers.Dense(10, input_dim=10, activation='relu'))
        model.add(layers.Dropout(0.2))
        model.add(layers.Flatten())
        model.add(layers.Dense(units = 1, activation = 'sigmoid'))
        optimizer= keras.optimizers.RMSprop(learning_rate=0.001)
        model.compile(loss='binary_crossentropy',
                      optimizer=optimizer, metrics=[keras.metrics.AUC(), 'accuracy'])
        model.fit(X, y)
        self.model = model
        return self
    def predict(self, X):
        return (self.model.predict(X) > 0.5).flatten()

把所有的碎片放在一起,你就可以把预测叠加起来:

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from mlxtend.classifier import StackingCVClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

X, y = make_classification()
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

class MyKerasModel(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        model = keras.Sequential()
        model.add(layers.Input(shape=X.shape[1]))
        model.add(layers.Dense(10, input_dim=10, activation='relu'))
        model.add(layers.Dropout(0.2))
        model.add(layers.Flatten())
        model.add(layers.Dense(units = 1, activation = 'sigmoid'))
        optimizer= keras.optimizers.RMSprop(learning_rate=0.001)
        model.compile(loss='binary_crossentropy',
                      optimizer=optimizer, metrics=[keras.metrics.AUC(), 'accuracy'])
        model.fit(X, y)
        self.model = model
        return self
    def predict(self, X):
        return (self.model.predict(X) > 0.5).flatten()

clf = StackingCVClassifier(
    classifiers=[RandomForestClassifier(), LogisticRegression(), MyKerasModel()],
    meta_classifier=MLPClassifier(),
).fit(X_train, y_train)
print("Stacking model score: %.3f" % clf.score(X_val, y_val))

输出:

2/2 [==============================] - 0s 11ms/step - loss: 0.8580 - auc: 0.5050 - accuracy: 0.5500
2/2 [==============================] - 0s 1ms/step
2/2 [==============================] - 0s 4ms/step - loss: 0.6955 - auc_1: 0.5777 - accuracy: 0.5750
2/2 [==============================] - 0s 1ms/step
3/3 [==============================] - 0s 3ms/step - loss: 0.7655 - auc_2: 0.6037 - accuracy: 0.6125
Stacking model score: 1.000

相关问题