pyspark 分析异常:尝试将withColumn()替换为select()时出现不明确的引用错误

afdcj2ne  于 2023-01-25  发布在  Spark
关注(0)|答案(1)|浏览(164)

当我在Pyspark中多次使用withColumn()更新列的值时,收到StackOverflowException错误。
我的代码与我得到了StackOverflowException是:

df = df.withColumn("element", when(df["element"] == 1,"first").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 2,"second").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 3,"third").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 4,"fourth").otherwise(df["element"]))

Spark文档建议使用select()函数,所以我尝试了一下:

df = df.select("*", (when(df["element"] == 1,"first")).alias("element"))
df = df.select("*", (when(df["element"] == 2,"second")).alias("element"))
df = df.select("*", (when(df["element"] == 3,"third")).alias("element"))
df = df.select("*", (when(df["element"] == 4,"fourth")).alias("element"))

但是我收到了一个错误,因为列"element"没有更新,另一个同名的列被创建了。错误如下:

Py4JJavaError: An error occurred while calling o3723.apply.
: org.apache.spark.sql.AnalysisException: Reference 'element' is ambiguous, could be: element, element.;
    at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:229)
    at org.apache.spark.sql.Dataset.col(Dataset.scala:1282)
    at org.apache.spark.sql.Dataset.apply(Dataset.scala:1249)
    at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

我怎么能这么做?
先谢谢你!

zsohkypk

zsohkypk1#

我想你可以多次使用.when,然后再使用.otherwise .另外,你应该给新列起个不同的名字:

df = df.withColumn("element_new", when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]))

使用.select

df = df.select("*",when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]).alias("element_new"))

输出示例:

+-------+-----------+
|element|element_new|
+-------+-----------+
|      1|      first|
|      2|     second|
|      3|      third|
|      4|     fourth|
|      5|          5|
+-------+-----------+

相关问题