Azure数据库PySpark自定义UDF模块未找到错误:未命名模块

6bc51xsx  于 2022-12-11  发布在  Spark
关注(0)|答案(1)|浏览(136)

我正在检查此SO,但没有一个解决方案对PySpark custom UDF ModuleNotFoundError: No module named有帮助
我有Azure数据库的最新资料库:

|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__

run_pipeline笔记本电脑上,我有以下内容

from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
    "local[*]").appName('workflow').getOrCreate()

df = text_cleaning.basic_clean(spark_df)

text_cleaning.py上,我有一个名为basic_clean的函数,它将运行如下内容:

def basic_clean(df):
    print('Removing links')
    udf_remove_links = udf(_remove_links, StringType())
    df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
    return df

当我在run_pipeline笔记型电脑上执行df.show()时,我收到这个错误消息:

Exception has occurred: PythonException       (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'

导入不应该工作吗?为什么这是一个问题?

ivqmmu1c

ivqmmu1c1#

群集上似乎缺少data-science模块请考虑将其安装到群集上请检查以下有关将库安装到群集链接https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries
您可以考虑执行pip list命令来查看集群上安装的库。
您可以考虑直接在笔记本电脑单元格中运行pip install data_science命令。

相关问题