udf方法\uuu getstate\uuu([])不存在错误

esbemjvw  于 2021-07-14  发布在  Spark
关注(0)|答案(0)|浏览(183)

我正在使用pyspark2.4.1,并试图用下面所示的pandas udf编写一个简单的函数。基本上创建一个新列并根据 df.x=='a' 以及 df.y=='t' . 然而,我不断地 Method __getstate__([]) does not exist 错误。以下是我尝试过的两种使用Pandas自定义项的方法,但不确定还有哪些方法可以编写:
数据

x = pd.Series(['a', 'b', 'c'])
y = pd.Series(['t','t','t'])

df = spark.createDataFrame(pd.DataFrame({"x":x,"y":y}))
df.show()
+---+---+
|  x|  y|
+---+---+
|  a|  t|
|  b|  t|
|  c|  t|
+---+---+

尝试1:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

import pandas as pd

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def test_fun(x: str, y: str) -> pd.Series:
    import os
    os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
    if x.values=='a' and y.values=='t':
        return z == 'ok'
    else:
        return z == "None"
    return pd.Series(z)
df.withColumn('test',test_fun(col("x"),col("y"))).show()

尝试2

def test_func(df):
    @pandas_udf(StringType(), PandasUDFType.SCALAR)
    def test(x: str, y: str) -> pd.Series:
        import os
        os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
        if x.values=='a' and y.values=='t':
            return z == 'ok'
        else:
            return z == "None"
        return pd.Series(z) 

    return df.withColumn('test', test(col('x'),col('y')))
test_func(df)

两个都给了我同样的错误信息:

...py4j.protocol.Py4JError: An error occurred while calling t.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

我对spark还很陌生,读了很多有类似问题的线程,但却找不出正确的修改方法。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题