pandas 如何将合并附加特征与TFIDF矢量结合

j91ykkif  于 2023-10-14  发布在  其他
关注(0)|答案(1)|浏览(79)

我使用下面的方法来训练一个线性回归器来预测推文的转发。我使用“text”作为特征,“retweet_count”作为要预测的目标。然而,我的数据中有几个额外的特征,例如hasMedia,hasHashtag,followers_count,sentiment(这些都是数字特征)。如何将这些功能与已转换为tfidf矢量的“text”组合合并?
我已经试过把Pandas串起来了。然后当我给予新的测试数据时,特征不匹配。请在Attributes mismatch between training and testing data in sklearn - linear regression中查看我的问题

def predict_retweets(dataset):
    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)

    keyword_response = tfidf.fit_transform(dataset['text']).toarray()

    X = keyword_response
    y = dataset['retweet_count']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    regressor = LinearRegression()

    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

    print(df)

    return None

数据样本

nmpmafwu

nmpmafwu1#

我想这个问题已经无关紧要了,但也许这可以帮助别人。
解决方案是使用numpy的hstack
代码如下:

# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
dataset = pd.DataFrame({
    "text": [
        "are red violets are blue iF you want to buy us here is a clue our eye amp cheek palettc",
        "is it too late now to say sorry",
        "oh no please email your order to social amp we can help this is a newest offer",
        "its best applied with our buffer brush",
        "dead",
        "our amazonian clay Full coverage Foundation comes in 40 shades of creamy goodn",
        "which one are you",
        "it got winter wanderlust swap the For with holiday in paradise collection packed with",
        "tartelettes peep our ig story For",
        "it nothing quite as pretty as a clutter",
        "hot deal alert pick up our shape tape matte or hydrating foundation for only 23 usd",
        "want to learn 2 ways to get with our busy gal brows tinted brow gel then head on ove",
        "please dm us your email address amp we can help",
        "the flamingo and Dineapplethemed palettes products and tools are bright fun and e9"
    ],
    "hasMedia" : [0,1,0,0,0,0,1,0,1,1,0,0,0,0],
    "hasHashtag": [1,1,0,0,0,1,1,0,1,1,0,1,0,0],
    "followers_count": [801745] * 14,
    "sentiment": [0.0772, 0, 0.5684, 0.6696, -0.7213, 0.5093, 0, 0.7845, 0, -0.428, 0, 0.3058, 0.6124, 0.7351],
    "retweet_count": [17,94, 0,0,0,13,88,37,3,10,11,1,0,28]
})

# Create a TfidfVectorizer for text feature extraction
tfidf = TfidfVectorizer(stop_words='english', lowercase=False)

# Calculate the TF-IDF features from the 'text' column of the dataset
keyword_response = tfidf.fit_transform(dataset['text']).toarray()

# Combine TF-IDF features with the numerical features
X = np.hstack((keyword_response, dataset[["hasMedia", "hasHashtag", "followers_count", "sentiment"]]))

# Set the target variable y as 'retweet_count'
y = dataset['retweet_count']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
regressor = LinearRegression()

# Train the model on the training data
regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = regressor.predict(X_test)

# Create a DataFrame to display actual and predicted values
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Print the DataFrame with actual and predicted values
print(df)

相关问题