通过多项式贝叶斯使用多预测器进行文本分类预测

vjrehmav  于 2021-08-25  发布在  Java
关注(0)|答案(0)|浏览(248)

我的数据集只包含分类变量。我在使用一个分类列预测另一个 Dataframe 方面有点问题,但我发现很难理解如何使用多个列/预测项进行预测。
假设我的数据集如下所示:

ItemCode  ItemDescription  Kind_of_food 
273          Snicker         Chocolate 
230          Lay's Chips       Chips
274          KitKat          Chocolate
123          Gummy Bears       Candy
124          Oreo            Cookies 
123          Gummy Bears       Candy  
273          Snicker        Chocolate          

. . . x 1000000 rows.

如果我只使用项目描述来预测项目代码,我首先清理了下面未显示的数据集(删除stopwords、撇号等)。然后我会通过列车测试来运行它。

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.metric import accuracy_score
from nltk.stem.porter import PorterStemmer()

x_train, x_test, y_train, y_test = train_test_split(df['ItemDescription'], df['ItemCode'], , train_size = 100000, test_size = 30000, stratify = df['ItemCode']

stemmer = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed(doc):
  return(stemmer.stem(w) for w in analyzer(doc))

vect = CountVectorizer(ngram = range(2,2), max_features = 500, stop_words = stopWords, analyzer = stemmed_words, tokenizer = word_tokenizer) # stopWords is defined earlier and not showed in code, 

X_train = vect.fit_transform(x_train)
X_test = vect.transform(x_test)

multiNB = MultinomialNB(alpha = 0.2)
multiNB.fit(X_train, y_train)
predicted = multiNB.predict(X_test)

print("accuracy of test model is: ", accuracy_score(predicted, y_test))

这段代码适用于1个预测器,但如果我要通过虚拟变量组合食物种类列。

dummies = pd.getDummies(df.Kind_of_food)
df = pd.concat([df, dummies], axis = 'columns')
df = df.drop(['ItemCode', 'Cookies'], axis = 1)

然后我创建一个新变量,

X = df[['ItemDescription', 'Cookies', 'Chips', 'Candy', 'Chocolate']]

并将列车试验分为:

x_train, x_test, y_train, y_test = train_test_split(df['ItemDescription'], df['ItemCode'], , train_size = 100000, test_size = 30000, stratify = df['ItemCode']

致:

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size = 100000, test_size = 30000, stratify = Y)

我会得到

Found input variables with inconsistent number of samples [3, 100000]

当我尝试运行相同的代码时。
当尝试安装x_系列(100000,3)和y_系列(100000)时,multinb.fit行上的代码中断,我应该如何调整代码并继续?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题