PySpark -使用rdd reduce函数抛出Py 4JJavaError

mf98qq94  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(192)

我一直在尝试运行这段代码,当它到达dist_reduced = dist_mapped.reduce(reducer)时,它抛出一个Py4JJavaError。我在3台不同的机器上尝试过,我得到了同样的错误。
Java version "1.8.0_202", Python 3.12.1, PySpark 3.5.0

代码

  1. import pandas as pd
  2. from collections import Counter
  3. from util import *
  4. from pyspark import SparkConf, SparkContext
  5. cols = ['sentiment', 'id', 'date', 'query_string', 'user', 'text']
  6. df = pd.read_csv('file.csv', encoding='latin1', names=cols)
  7. data = df['text'].tolist()
  8. conf = SparkConf().setAppName('MapReduce').setMaster('local[4]')
  9. sc = SparkContext(conf=conf)
  10. dist = sc.parallelize(data)
  11. def mapper(text):
  12. words = text.split()
  13. cleaned = map(clean_word, words)
  14. filtered = filter(word_not_in_stopwords, cleaned)
  15. return Counter(filtered)
  16. def reducer(cnt1, cnt2):
  17. cnt1.update(cnt2)
  18. return cnt1
  19. dist_mapped = dist.map(mapper)
  20. dist_reduced = dist_mapped.reduce(reducer)
  21. dist_data_reduced.most_common(5)

字符串

追踪

  1. ---------------------------------------------------------------------------
  2. Py4JJavaError Traceback (most recent call last)
  3. Cell In[10], line 2
  4. 1 dist_mapped = dist.map(mapper)
  5. ----> 2 dist_reduced = dist_mapped.reduce(reducer)
  6. File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyspark\rdd.py:1924, in RDD.reduce(self, f)
  7. 1921 return
  8. 1922 yield reduce(f, iterator, initial)
  9. -> 1924 vals = self.mapPartitions(func).collect()
  10. 1925 if vals:
  11. 1926 return reduce(f, vals)
  12. File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyspark\rdd.py:1833, in RDD.collect(self)
  13. 1831 with SCCallSiteSync(self.context):
  14. 1832 assert self.ctx._jvm is not None
  15. -> 1833 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  16. 1834 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
  17. File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
  18. 1316 command = proto.CALL_COMMAND_NAME +\
  19. 1317 self.command_header +\
  20. 1318 args_command +\
  21. 1319 proto.END_COMMAND_PART
  22. 1321 answer = self.gateway_client.send_command(command)
  23. -> 1322 return_value = get_return_value(
  24. 1323 answer, self.gateway_client, self.target_id, self.name)
  25. 1325 for temp_arg in temp_args:
  26. 1326 if hasattr(temp_arg, "_detach"):
  27. File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
  28. 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
  29. 325 if answer[1] == REFERENCE_TYPE:
  30. --> 326 raise Py4JJavaError(
  31. 327 "An error occurred while calling {0}{1}{2}.\n".
  32. 328 format(target_id, ".", name), value)
  33. 329 else:
  34. 330 raise Py4JError(
  35. 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
  36. 332 format(target_id, ".", name, value))
  37. Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
  38. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2) (Corsair executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
  39. at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
  40. ...

系统变量

  • HADOOP_HOME = C:\hadoop
  • SPARK_HOME = %USERPROFILE%\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyspark
  • Path = %HADOOP_HOME%\bin
  • Path = %SPARK_HOME%\bin

用户变量

  • JDK_HOME = %JAVA_HOME%
  • JRE_HOME = %JAVA_HOME%\jre
  • Path = %JAVA_HOME%\lib
  • Path = %JAVA_HOME%\jre\lib
i34xakig

i34xakig1#

从跟踪日志中可以清楚地看到,它无法找到python3
检查运行Corsair executor驱动程序的系统上是否安装了Python 3。使用python3 --versionwhich python3进行确认。

确保包含python3可执行文件的目录包含在PATH环境变量中。

相关问题