python spark ioexception:(null)命令字符串中的条目错误

68bkxrlz 于 2021-05-26 发布在 Spark

关注(0)|答案(0)|浏览(264)

我正在尝试运行一个阿拉伯语情感分类器代码，但当我运行该代码时，出现以下错误： java.io.IOException: (null) entry in command string: null chmod 0644 C:\Spark\spark-3.0.1-bin-hadoop2.7\bin\lrModel.model\metadata\_temporary\0\_temporary\attempt_20201212141604_0144_m_000000_0\part-00000 我搜索了一些解决方案，但它们都引用了hadoop和下载winutils.exe文件，但我已经添加了c:\hadoop\bin文件夹并将winutils.exe粘贴到其中。我已经在环境变量中添加了hadoop\u home和hadoop\u home%bin。这个案子还有别的解决办法吗？这是我的密码：

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import rand
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StringIndexer
from pyspark.ml.classification import LogisticRegression

# initialize spark context and session

conf = SparkConf().setMaster("local").setAppName("TrainSentimentModel")
sc = SparkContext(conf = conf)
spark = SparkSession(sc)

# read the sentiment tsv file into an rdd and split it based upon tab

lines = sc.textFile("train_Arabic_tweets_20190413.tsv").map(lambda x: x.split("\t"))

# define the schema

schema = StructType([StructField("target", StringType(), True), StructField("tweet", StringType(), True)])

# create dataframe from rdd

docs = spark.createDataFrame(lines, schema)

# split the dataset into training and testing

(train_set, test_set) = docs.orderBy(rand()).randomSplit([0.8, 0.2], seed = 2000)

# define the processing pipeline (tokenize->tf->idf->label_indexer)

tokenizer = Tokenizer(inputCol="tweet", outputCol="words")
hashtf = HashingTF(inputCol="words", outputCol='tf')
idf = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
indexedLabel = StringIndexer(inputCol = "target", outputCol = "label")
pipeline = Pipeline(stages=[tokenizer, hashtf, idf, indexedLabel])

# apply the pipeline to the training and testing datasets

pipelineFit = pipeline.fit(train_set)
train_df = pipelineFit.transform(train_set)
test_df = pipelineFit.transform(test_set)

# initialize a logistic regression

lr = LogisticRegression(maxIter=100)

# train the classfier on the training dataset

lrModel = lr.fit(train_df)

# save the classifier

lrModel.save('lrModel.model')

# apply the model to tesing data

predictions = lrModel.transform(test_df)

# compute the test set accuracy

accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(test_set.count())
print("****************************************\n")
print("Test set accuracy " + str(accuracy) + "\n")
print("****************************************\n")

sc.stop()

这是我的截图：