我得到了错误
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/home/hadoop/project/backend/pyspark/dist/libs.zip
当我试图通过spark submit提交pyspark作业时
spark-submit \
--jars /usr/share/aws/redshift/jdbc/RedshiftJDBC42.jar,jars/hadoop-aws-2.6.0.jar,jars/graphframes-0.7.0-spark2.4-s_2.11.jar \
--py-files $HOME/project/backend/pyspark/dist/shared.zip, $HOME/project/backend/pyspark/dist/libs.zip \
$HOME/project/backend/pyspark/dist/main.py
这个libs.zip包含我需要的所有标准python模块和shared.zip,我自己的python模块用于udf。
我通过如下命令创建zip文件
cd ./src && zip -x *.sh -x \*libs\* -r ../dist/shared.zip .
cd ./src/libs && zip -x "numpy/*" -x "pandas/*" -r ../../dist/libs.zip .
libs.zip包含所有python模块
$ ls -1 libs
bin
decorator-4.4.1.dist-info
decorator.py
graphframes
graphframes-0.6.dist-info
man
networkx
networkx-2.4.dist-info
nose
nose-1.3.7.dist-info
numpy-1.17.4.dist-info
pyarrow
pyarrow-0.15.1.dist-info
share
six-1.13.0.dist-info
six.py
以及shared.zip文件,所有其他应在每个worker上分发的文件
tingel$ ls -1 shared
config_environment.py
curl_resul.json
modules
tingel$ ls -1 shared/modules/
__init__.py
udf_networkx.py
那么,为什么文件分发不正确,无法加载??它们被加载到main.py文件中,我像这样提交
# 1) load system modules
import sys
import os
import logging
from py4j.protocol import Py4JJavaError
# 2) load custom modules
import modules.udf_networkx as my_udf
import config_environment as config
# 3) load pyspark (spark) and graphframe (graphx) modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import Row
import pyspark.sql.functions as f
from graphframes import GraphFrame
from graphframes.lib import *
AM=AggregateMessages
版本:
电子病历:emr-5.26.0
python:python 3.6版
Spark:Spark2.4.3
暂无答案!
目前还没有任何答案,快来回答吧!