如何将预测值合并到原始Pandas测试数据框中,其中X_test已在拆分前使用CountVectorizer进行了转换

ztmd8pv5  于 2023-02-02  发布在  其他
关注(0)|答案(1)|浏览(84)

我想把我的测试数据的预测结果合并到我的X_test中。我可以把它和y_test合并,但是因为我的X_test是一个语料库,我不确定我如何识别要合并的索引。我的代码如下

def lr_model(df):

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    import pandas as pd
   
    # Create corpus as a list
    corpus = df['text'].tolist()
    cv = CountVectorizer()
    X = cv.fit_transform(corpus).toarray()
    y = df.iloc[:, -1].values

    # Splitting to testing and training
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

    # Train Logistic Regression on Training set
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(X_train, y_train)

    # Predicting the Test set results
    y_pred = classifier.predict(X_test)

    # Merge true vs predicted labels
    true_vs_pred = pd.DataFrame(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

    return true_vs_pred

这给了我y_test和y_pred,但我不确定如何将X_test作为原始 Dataframe (X_test的id)添加到其中。任何指导都非常感谢。谢谢

e7arh2l6

e7arh2l61#

使用管道可以帮助您将原始X_test与预测链接起来:

def lr_model(df):

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    import pandas as pd
    from sklearn.pipeline import Pipeline

    # Defining X and y
    cv = CountVectorizer()
    X = df['text']
    y = df.iloc[:, -1].values

    # Splitting to testing and training
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

    # Create a pipeline
    pipeline = Pipeline([
        ('CountVectorizer', cv),
        ('LogisticRegression', LogisticRegression(random_state = 0)),
    ])

    # Train pipeline on Training set
    pipeline.fit(X_train, y_train)

    # Predicting the Test set results
    y_pred = pipeline.predict(X_test)

    # Merge true vs predicted labels
    true_vs_pred = pd.DataFrame(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

    return true_vs_pred, X_test

相关问题