我想在批处理作业中运行rdd函数。为此,我创建了一个初始dockerfile,它创建了一个docker映像,其run命令将执行此作业。如前所述,作业中包含一些python逻辑,它希望使用spark。
为此,我创建了另一个基于初始图像的docker文件,添加了一个覆盖入口点的层,以使用spark\u入口点。
当我运行这个作业时,会发生什么呢?我让spark管理器在job pod中运行,但是executor pod会立即失败,我在错误中看到的是“没有名为module的模块”。
错误为:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 22, 172.19.65.149, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/root/miniconda/envs/test/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/root/miniconda/envs/test/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/root/miniconda/envs/test/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/root/miniconda/envs/test/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in loads
return pickle.loads(obj)
ImportError: No module named module
我的spark主页指向一个pyspark包:
ENV SPARK_HOME /root/miniconda/envs/test/lib/python2.7/site-packages/pyspark
我的pyspark版本是2.4.4,它是通过conda安装而不是pip安装的。
我的rdd是这样的:
def take_time(x):
from math import cos
[cos(j) for j in range(100)]
return cos(x)
def run(*args):
import time
start_time = time.time()
spark_instance = SparkFactory.get_instance_adapter('test-job', num_executors=20)
sc = spark_instance.get_context()
rdd1 = sc.parallelize(range(10000000), 1000)
interim = rdd1.map(lambda x: take_time(x))
print('output =', interim.reduce(lambda x, y: x + y))
print('took {}'.format(start_time - time.time()))
sc.stop()
if __name__ == '__main__':
Config.init()
run()
暂无答案!
目前还没有任何答案,快来回答吧!