在Databricks中的pyspark框架上下载punkt时出现NLTK错误

a6b3iqyw 于 2023-10-15 发布在 Spark

关注(0)|答案(2)|浏览(179)

我试图通过将余弦相似性应用于Databricks中的pyspark框架来找到文本列（'title'，'headline'）的相似性。我的函数名为'cosine_sim_udf'，为了能够使用它，我必须进行第一次udf转换。
我得到查找错误后，应用功能的df。有人知道原因或有解决方案吗？
我的函数是寻找余弦相似性;

nltk.download('punkt')

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return float(((tfidf * tfidf.T).A)[0,1])

cosine_sim_udf = udf(cosine_sim, FloatType())

df2 =  df.withColumn('cosine_distance', cosine_sim_udf('title', 'headline')) # title and headline are text to find similarities

然后我得到这个错误

PythonException: 'LookupError: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 426.0 failed 4 times, most recent failure: Lost task 0.3 in stage 426.0 (TID 2135) (10.109.245.129 executor 1): org.apache.spark.api.python.PythonException: 'LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/databricks/python/nltk_data'
    - '/databricks/python/share/nltk_data'
    - '/databricks/python/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'

pyspark

来源：https://stackoverflow.com/questions/71297688/nltk-lookup-error-for-punkt-downloading-on-pyspark-dataframe-in-databricks

2条答案

按热度按时间

xzv2uavs1#

问题在于，在您的示例中，nltk.download('punkt')仅在驱动程序节点上执行，而您的UDF函数在没有安装它的工作节点上执行。
您有以下可能性：

使用cluster init script安装所需的资源，类似这样（它将在所有节点上安装此文件）：

#!/bin/bash

pip install nltk
python -m nltk.downloader punkt

类似的东西（还没有测试过，但可能有效。也可能不适用于自动缩放集群）：

import nltk
num_executors = sc._jsc.sc().getExecutorMemoryStatus().size()-1
sc.parallelize((("")*num_executors), num_executors) \
  .mapPartitions(lambda p: [nltk.download('punkt')]).collect()

赞(0）回复(0）举报 2023-10-15

dw1jzc5e2#

注：此答案在2023-08运行的DataBricks版本上进行测试。用户界面往往会随着时间的推移而变化，所以请始终检查文档：
https://docs.databricks.com/en/libraries/workspace-libraries.html
引用：“工作区库用作本地存储库，您可以从中创建群集安装的库。.必须先在群集上安装工作区库，然后才能在笔记本或作业中使用它”
在Databricks中，您可以将库安装到群集（供您自己使用），或安装在工作区中以使其可用于工作区中的所有群集。
这个库可以是例如pypi中的python lib（nltk），你构建的一个python文件等。
来回顾一下docs说的话：要在一个群集中安装，请执行以下操作：

在您的工作区中，选择左侧窗口中的“计算”。
选择“库”选项卡。
点击“立即安装”按钮。
对于您的情况，选择“library source”Pypi，“package”nltk [我不知道为什么，但是当我指定'nltk==3.8.1'时，安装似乎失败了]。建议使用特定版本。
重新启动群集
试试看：

import nltk
nltk.download('punkt')

要在工作区中安装：

点击“工作区”
选择你要安装的工作区。例如“共享”
右键单击，创建，库

现在，群集下次重新启动时即可使用该库。

赞(0）回复(0）举报 2023-10-15

我来回答

在Databricks中的pyspark框架上下载punkt时出现NLTK错误

2条答案

相关问题

热门标签

最新问答