我正在检查此SO,但没有一个解决方案对PySpark custom UDF ModuleNotFoundError: No module named有帮助
我有Azure数据库的最新资料库:
|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__
在run_pipeline
笔记本电脑上,我有以下内容
from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
"local[*]").appName('workflow').getOrCreate()
df = text_cleaning.basic_clean(spark_df)
在text_cleaning.py
上,我有一个名为basic_clean的函数,它将运行如下内容:
def basic_clean(df):
print('Removing links')
udf_remove_links = udf(_remove_links, StringType())
df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
return df
当我在run_pipeline
笔记型电脑上执行df.show()
时,我收到这个错误消息:
Exception has occurred: PythonException (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'
导入不应该工作吗?为什么这是一个问题?
1条答案
按热度按时间ivqmmu1c1#
群集上似乎缺少
data-science
模块请考虑将其安装到群集上请检查以下有关将库安装到群集链接https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries您可以考虑执行
pip list
命令来查看集群上安装的库。您可以考虑直接在笔记本电脑单元格中运行
pip install data_science
命令。