aws emr-无法从jar加载主类(pyspark-python模块)

hgncfbus  于 2021-05-26  发布在  Spark
关注(0)|答案(0)|浏览(269)

我得到了错误

Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/home/hadoop/project/backend/pyspark/dist/libs.zip

当我试图通过spark submit提交pyspark作业时

spark-submit \
    --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC42.jar,jars/hadoop-aws-2.6.0.jar,jars/graphframes-0.7.0-spark2.4-s_2.11.jar \
    --py-files $HOME/project/backend/pyspark/dist/shared.zip, $HOME/project/backend/pyspark/dist/libs.zip \
    $HOME/project/backend/pyspark/dist/main.py

这个libs.zip包含我需要的所有标准python模块和shared.zip,我自己的python模块用于udf。
我通过如下命令创建zip文件

cd ./src && zip -x *.sh -x \*libs\* -r ../dist/shared.zip .
cd ./src/libs && zip -x "numpy/*" -x "pandas/*" -r ../../dist/libs.zip .

libs.zip包含所有python模块

$ ls -1 libs 
bin
decorator-4.4.1.dist-info
decorator.py
graphframes
graphframes-0.6.dist-info
man
networkx
networkx-2.4.dist-info
nose
nose-1.3.7.dist-info
numpy-1.17.4.dist-info
pyarrow
pyarrow-0.15.1.dist-info
share
six-1.13.0.dist-info
six.py

以及shared.zip文件,所有其他应在每个worker上分发的文件

tingel$ ls -1 shared
config_environment.py
curl_resul.json
modules

tingel$ ls -1 shared/modules/
__init__.py
udf_networkx.py

那么,为什么文件分发不正确,无法加载??它们被加载到main.py文件中,我像这样提交


# 1) load system modules

import sys
import os
import logging
from py4j.protocol import Py4JJavaError

# 2) load custom modules

import modules.udf_networkx as my_udf
import config_environment as config

# 3) load pyspark (spark) and graphframe (graphx) modules

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import Row

import pyspark.sql.functions as f
from graphframes import GraphFrame
from graphframes.lib import *
AM=AggregateMessages

版本:
电子病历:emr-5.26.0
python:python 3.6版
Spark:Spark2.4.3

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题