从findsynonymsarray返回数组时引发pyspark word2vec错误

jrcvhitl  于 2021-05-29  发布在  Spark
关注(0)|答案(0)|浏览(477)

我正在使用pyspark2.4.5。我有word2vec模型。现在我需要从数据框中的一列中找到单词列表中的同义词。下面是代码

word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="word", 
outputCol="words")
word2Vec_model = word2Vec.fit(df.filter(df.word.isNotNull()))

from pyspark.sql.types import *
from pyspark.sql import functions as F
def get_wordarray(word):
    res=str(word2Vec_model.findSynonymsArray(word, 5))
    return res

get_wordarray_udf = F.udf(get_wordarray, StringType())

方法1:使用Map

df.select("words").rdd.map(lambda get_wordarray(x))

方法2:使用自定义项

df.select(get_wordarray_udf(F.col("words")).alias('res'))

输入:
从上面的单词是一列,其中有单词ex:applicable,tuned等
当我给出一个单独的词时,它工作得很好:

res=str(word2Vec_model.findSynonymsArray("applicable", 5))

当我试着用下面这样的列表工作良好

l=['word','detail']
for a in l:
    get_wordarray(a)

但我需要同样的方法来获得整个专栏的成功,尝试了以上方法我得到的错误如下:

Could not serialize object: Py4JError: An error occurred while calling o517.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
 File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
   return self(*args)
 File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
 File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
   self._judf_placeholder = self._create_judf()
 File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
   wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling 
o517.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题