pyspark Spark RDD.pipe FileNotFoundError:[WinError 2]系统找不到指定的文件

zf2sa74q  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(151)

我的目标是通过RDD.pipe从pyspark调用一个外部(dotnet)进程。由于这失败了,我想测试管道到一个简单的命令:

  1. spark = SparkSession.builder.master("local").appName("test").getOrCreate()
  2. result_rdd = spark.sparkContext.parallelize(['1', '2', '', '3']).pipe(command).collect()

但是,我得到错误消息:

  1. py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
  2. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) ( executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  3. File "C:\projectpath\.venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 686, in main
  4. File "C:\projectpath\.venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 676, in process
  5. File "C:\projectpath\.venv\lib\site-packages\pyspark\rdd.py", line 540, in func
  6. return f(iterator)
  7. File "C:\projectpath\.venv\lib\site-packages\pyspark\rdd.py", line 1117, in func
  8. pipe = Popen(shlex.split(command), env=env, stdin=PIPE, stdout=PIPE)
  9. File "C:\Users\username\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 951, in __init__
  10. self._execute_child(args, executable, preexec_fn, close_fds,
  11. File "C:\Users\username\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1420, in _execute_child
  12. hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
  13. FileNotFoundError: [WinError 2] The system cannot find the file specified
kadbb459

kadbb4591#

更新:我找到了一个变通办法,让它为我工作。我看了一下pipe函数的pyspark实现,如果没有给出env参数,他们会使用一个空字典作为Popen的env参数,这导致了我直接为Popen做同样的错误。只是添加了一些字典的值修复了这个问题:

  1. pipe(command, env={"1":"2"})

相关问题