如何在pyspark中对Dataframe执行算术运算?

8yparm6h  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(276)

我需要验证我写的代码是否正确。为此,我必须使用以下公式:

(nvl(units_inflow,0)- nvl(units_inflow_can,0)-nvl(units_outflow,0)+nvl(units_outflow_can,0))*nav_value

这段代码在oraclesql中,我需要在pyspark中执行相同的操作。到现在为止,就像 nvl 是用在上面的代码,我用过 fill() 在pyspark中,将空值替换为0。
在我的t3Dataframe中有5列,即

["units_inflow","units_inflow_can","units_outflow","units_outflow_can","nav_value"]

到目前为止,我编写的代码是:

t3= t3.na.fill(value=0,subset=["units_inflow","units_inflow_can","units_outflow","units_outflow_can"])
z = t3.select("units_inflow").groupby().sum().show()

y = t3.select("units_inflow_can").groupby().sum().show()

x = t3.select("units_outflow").groupby().sum().show()

w = t3.select("units_outflow_can").groupby().sum().show()

u = t3.select("nav_value").groupby().sum().collect()

print(u)

尽管做了这些之后,我无法得到输出。我想我在代码转换的某个地方出错了。考虑到每列输出的总和,我在计算器中分别进行了算术运算。

0s7z1bwu

0s7z1bwu1#

Oracle nvl 功能与 coalesce ,您只需通过替换 nvl 功能:

from pyspark.sql import functions as F

t3.select(
    (
        F.coalesce(F.col("units_inflow"), F.lit(0)) -
        F.coalesce(F.col("units_inflow_can"), F.lit(0)) -
        F.coalesce(F.col("units_outflow"), F.lit(0)) +
        F.coalesce(F.col("units_outflow_can"), F.lit(0))
    ) * F.col("nav_value")
).show()

或使用sql表达式:

t3.select(
    F.expr("""(
            coalesce(units_inflow, 0) - coalesce(units_inflow_can, 0) -
            coalesce(units_outflow, 0) + coalesce(units_outflow_can, 0)
           ) * nav_value
    """)
).show()

相关问题