我正在使用pyspark2.4.5。我有word2vec模型。现在我需要从数据框中的一列中找到单词列表中的同义词。下面是代码
word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="word",
outputCol="words")
word2Vec_model = word2Vec.fit(df.filter(df.word.isNotNull()))
from pyspark.sql.types import *
from pyspark.sql import functions as F
def get_wordarray(word):
res=str(word2Vec_model.findSynonymsArray(word, 5))
return res
get_wordarray_udf = F.udf(get_wordarray, StringType())
方法1:使用Map
df.select("words").rdd.map(lambda get_wordarray(x))
方法2:使用自定义项
df.select(get_wordarray_udf(F.col("words")).alias('res'))
输入:
从上面的单词是一列,其中有单词ex:applicable,tuned等
当我给出一个单独的词时,它工作得很好:
res=str(word2Vec_model.findSynonymsArray("applicable", 5))
当我试着用下面这样的列表工作良好
l=['word','detail']
for a in l:
get_wordarray(a)
但我需要同样的方法来获得整个专栏的成功,尝试了以上方法我得到的错误如下:
Could not serialize object: Py4JError: An error occurred while calling o517.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling
o517.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
暂无答案!
目前还没有任何答案,快来回答吧!