我的数据集只包含分类变量。我在使用一个分类列预测另一个 Dataframe 方面有点问题,但我发现很难理解如何使用多个列/预测项进行预测。
假设我的数据集如下所示:
ItemCode ItemDescription Kind_of_food
273 Snicker Chocolate
230 Lay's Chips Chips
274 KitKat Chocolate
123 Gummy Bears Candy
124 Oreo Cookies
123 Gummy Bears Candy
273 Snicker Chocolate
. . . x 1000000 rows.
如果我只使用项目描述来预测项目代码,我首先清理了下面未显示的数据集(删除stopwords、撇号等)。然后我会通过列车测试来运行它。
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.metric import accuracy_score
from nltk.stem.porter import PorterStemmer()
x_train, x_test, y_train, y_test = train_test_split(df['ItemDescription'], df['ItemCode'], , train_size = 100000, test_size = 30000, stratify = df['ItemCode']
stemmer = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed(doc):
return(stemmer.stem(w) for w in analyzer(doc))
vect = CountVectorizer(ngram = range(2,2), max_features = 500, stop_words = stopWords, analyzer = stemmed_words, tokenizer = word_tokenizer) # stopWords is defined earlier and not showed in code,
X_train = vect.fit_transform(x_train)
X_test = vect.transform(x_test)
multiNB = MultinomialNB(alpha = 0.2)
multiNB.fit(X_train, y_train)
predicted = multiNB.predict(X_test)
print("accuracy of test model is: ", accuracy_score(predicted, y_test))
这段代码适用于1个预测器,但如果我要通过虚拟变量组合食物种类列。
dummies = pd.getDummies(df.Kind_of_food)
df = pd.concat([df, dummies], axis = 'columns')
df = df.drop(['ItemCode', 'Cookies'], axis = 1)
然后我创建一个新变量,
X = df[['ItemDescription', 'Cookies', 'Chips', 'Candy', 'Chocolate']]
并将列车试验分为:
x_train, x_test, y_train, y_test = train_test_split(df['ItemDescription'], df['ItemCode'], , train_size = 100000, test_size = 30000, stratify = df['ItemCode']
致:
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size = 100000, test_size = 30000, stratify = Y)
我会得到
Found input variables with inconsistent number of samples [3, 100000]
当我尝试运行相同的代码时。
当尝试安装x_系列(100000,3)和y_系列(100000)时,multinb.fit行上的代码中断,我应该如何调整代码并继续?
暂无答案!
目前还没有任何答案,快来回答吧!