通过多项式贝叶斯使用多预测器进行文本分类预测

vjrehmav  于 2021-08-25  发布在  Java
关注(0)|答案(0)|浏览(287)

我的数据集只包含分类变量。我在使用一个分类列预测另一个 Dataframe 方面有点问题,但我发现很难理解如何使用多个列/预测项进行预测。
假设我的数据集如下所示:

  1. ItemCode ItemDescription Kind_of_food
  2. 273 Snicker Chocolate
  3. 230 Lay's Chips Chips
  4. 274 KitKat Chocolate
  5. 123 Gummy Bears Candy
  6. 124 Oreo Cookies
  7. 123 Gummy Bears Candy
  8. 273 Snicker Chocolate
  9. . . . x 1000000 rows.

如果我只使用项目描述来预测项目代码,我首先清理了下面未显示的数据集(删除stopwords、撇号等)。然后我会通过列车测试来运行它。

  1. import numpy as np
  2. import pandas as pd
  3. from sklearn.naive_bayes import MultinomialNB
  4. from sklearn.feature_extraction.text import CountVectorizer
  5. from sklearn.model_selection import train_test_split
  6. from nltk.tokenize import word_tokenize
  7. from nltk.corpus import stopwords
  8. from sklearn.metric import accuracy_score
  9. from nltk.stem.porter import PorterStemmer()
  10. x_train, x_test, y_train, y_test = train_test_split(df['ItemDescription'], df['ItemCode'], , train_size = 100000, test_size = 30000, stratify = df['ItemCode']
  11. stemmer = PorterStemmer()
  12. analyzer = CountVectorizer().build_analyzer()
  13. def stemmed(doc):
  14. return(stemmer.stem(w) for w in analyzer(doc))
  15. vect = CountVectorizer(ngram = range(2,2), max_features = 500, stop_words = stopWords, analyzer = stemmed_words, tokenizer = word_tokenizer) # stopWords is defined earlier and not showed in code,
  16. X_train = vect.fit_transform(x_train)
  17. X_test = vect.transform(x_test)
  18. multiNB = MultinomialNB(alpha = 0.2)
  19. multiNB.fit(X_train, y_train)
  20. predicted = multiNB.predict(X_test)
  21. print("accuracy of test model is: ", accuracy_score(predicted, y_test))

这段代码适用于1个预测器,但如果我要通过虚拟变量组合食物种类列。

  1. dummies = pd.getDummies(df.Kind_of_food)
  2. df = pd.concat([df, dummies], axis = 'columns')
  3. df = df.drop(['ItemCode', 'Cookies'], axis = 1)

然后我创建一个新变量,

  1. X = df[['ItemDescription', 'Cookies', 'Chips', 'Candy', 'Chocolate']]

并将列车试验分为:

  1. x_train, x_test, y_train, y_test = train_test_split(df['ItemDescription'], df['ItemCode'], , train_size = 100000, test_size = 30000, stratify = df['ItemCode']

致:

  1. x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size = 100000, test_size = 30000, stratify = Y)

我会得到

  1. Found input variables with inconsistent number of samples [3, 100000]

当我尝试运行相同的代码时。
当尝试安装x_系列(100000,3)和y_系列(100000)时,multinb.fit行上的代码中断,我应该如何调整代码并继续?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题