spark作业src
from __future__ import print_function
import sys
import pandas as pd
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PythonSort")\
.getOrCreate()
spark.sparkContext.setLogLevel('WARN')
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
df.show()
spark.stop()
按以下命令打包压缩文件
pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip
ls dependencies
__pycache__ numpy pandas-1.0.5.dist-info python_dateutil-2.8.1.dist-info six-1.15.0.dist-info
bin numpy-1.18.5.dist-info pyarrow pytz six.py
dateutil pandas pyarrow-0.17.1.dist-info pytz-2020.1.dist-info
正如你在这里看到的, pandas
, numpy
, dateutil
, pytz
安装在这里
使用spark sumbit提交python文件
bin/spark-submit --master local --py-files ../dependencies.zip ~/Desktop/sort.py
spark报告错误
Traceback (most recent call last):
File "/Users/xxx/Desktop/sort.py", line 21, in <module>
import pandas as pd
File "/Users/xxx/ttt/dependencies.zip/pandas/__init__.py", line 13
missing_dependencies.append(f"{dependency}: {e}")
看来Spark找到了 pandas
在dependecies.zip中,但找不到 numpy
或其他 Package 。
问题是:将Pandas和它的朋友纳入spark工作的最佳做法是什么?
暂无答案!
目前还没有任何答案,快来回答吧!