pandas 列变压器和train_test_split过程

6pp0gazn  于 2022-11-05  发布在  其他
关注(0)|答案(1)|浏览(194)

我目前正在学习Scikit-learn(请不要责备我),而且我对ColumnTransformer、训练和预测的过程有点困惑。我有一个数据集,其中包含性别、已婚、毕业状态、贷款金额、收入等特征。该数据集包含混合对象(字符串)和整数值,但我会说大部分是对象。从我的理解来看,我需要在训练模型之前将对象转换为整数值,我是用ColumnTransformer来实现的。但是训练模型的过程让我有点困惑。这是我当前的代码:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

df = pd.read_csv("loan_data.csv", sep=",")
df.replace("", np.nan, inplace=True)
df.dropna(inplace=True)
df = df.drop(columns=["Loan_ID"])

X = df.drop(columns=["LoanAmount"])
y = df["LoanAmount"]

loan_categories = ["Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Loan_Status"]
ohe = OneHotEncoder()

ct = make_column_transformer (
    (ohe, loan_categories),
    remainder="passthrough")

ct.fit_transform(X)

然后我对train_test_split产生了困惑,我应该在把X传递给fit_transform之前进行train_test_split,还是在定义了ct之后才进行?
我的代码的其余部分看起来像这样:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
8zzbczxx

8zzbczxx1#

嗨,如果您想使用fit_trasform,请尝试以下操作:

X = df.drop(columns=["LoanAmount"])
    y = df["LoanAmount"]
cv = CountVectorizer(max_features = 5000,ngram_range=(1,128),min_df=2,analyzer='word')
    x = cv.fit_transform(X).toarray()
    print("X.shape = ",x.shape)
    print("y.shape = ",y.shape)

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state = 42)

        model = DecisionTreeClassifier()
        model.fit(X_train, y_train)

        predictions = model.predict(X_test)
        score = accuracy_score(y_test, predictions)

在将X传递给fit_transform之前,我是否应该进行train_test_split?答案是

相关问题