我正在学习如何使用机器学习与Sparkmllib的目的是做情绪分析的推特。我从这里得到了一个情绪分析数据集:http://thinknook.com/wp-content/uploads/2012/09/sentiment-analysis-dataset.zip
该数据集包含100万条推文,分为正面或负面。这个数据集的第二列包含情绪,第四列包含tweet。
这是我当前的Pypark代码:
import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression
data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)
r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))
parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)
remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)
train_data_raw = base_words.select("base_words", "label")
word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")
model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)
lrModel.transform(final_train_data).show()
我在pyspark交互式shell上使用以下命令执行此操作:
pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'
(仅供参考:我有一个hdfs集群,包含10个vm,其中包含yarn、spark等)
最后一行代码的结果是:
>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows
如果我对手动创建的较小的数据集执行相同的操作,它就会工作。我不知道发生了什么,一整天都在处理这个。
有什么建议吗?
谢谢你的时间!
1条答案
按热度按时间zpqajqem1#
热释光;dr-ten迭代对于任何实际应用程序来说都是非常低的。在大型和非平凡的数据集上,可能需要数千次或更多的迭代(以及调整剩余参数)才能收敛。
二项式
LogisticRegressionModel
有summary
属性,它可以让您访问LogisticRegressionSummary
对象。它还包含其他有用的指标objectiveHistory
可用于调试培训流程: