在pyspark中替换2020年“booking”列的值

xlpyo6sf  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(394)

我在pyspark中有一个数据框,包含2019年和2020年的数据。

如果2020年日期的任何预订值小于25,则将其替换为2019年相同日期的移动平均值。
例子:

必修的:

我能做到

targetDf = df.withColumn("Booking",when(df["date"] == "2020-01-12", 75).otherwise(df["Booking"]))

但是我有太多的值要替换,所以我尝试了下面的代码

targetDf = df.withColumn("Booking",\
 when(df["Booking"] <= 25, (df["movingAvg"].when(df["date"] == ?)).otherwise(df["Booking"]))

我不知道怎么写(?)去年同期移动平均值。

pvabu6sv

pvabu6sv1#

您可以使用self-left-join,条件是 VenueName = VenueName 以及 date - 1 year = date ,如果预订<25,则获得去年平均值 when :

from pyspark.sql import functions as F

df = df.withColumn("date", F.to_date("date", "dd-MM-yyyy"))

df1 = df.alias("df1").join(
    df.alias("df2"),
    (F.col("df1.VenueName") == F.col("df2.VenueName")) &
    (F.expr("df1.date - INTERVAL 1 year") == F.col("df2.date")),
    "left"
).select(
    "df1.name_id", "df1.VenueName", "df1.date",
    F.when(
        F.col("df1.Booking") < 25, F.coalesce("df2.movingAvg", "df1.Booking")
    ).otherwise(F.col("df1.Booking")).alias("Booking"),
    "df1.movingAvg"
)

# verify changes for dates = 2020-01-12 / 2019-01-12

df1.filter("date in ('2020-01-12', '2019-01-12')").show()

姓名:idvenuenamedatebookingmovingavg1367tortuga2020-01-127537.428571431011tortuga2019-01-12100.075.10286

vdgimpew

vdgimpew2#

对于这种类型的用例,可以使用相关子查询。它类似于自连接,但语法更直观。

df2 = df.withColumn('date', F.to_date('date', 'dd-MM-yyyy'))
df2.createOrReplaceTempView('df2')

result = spark.sql("""
    select
        name_id, VenueName, date,
        case when Booking < 25
             then (
                 select first(t2.movingAvg)
                 from df2 t2
                 where year(t1.date) = year(t2.date) + 1
                   and month(t1.date) = month(t2.date)
                   and dayofmonth(t1.date) = dayofmonth(t2.date)
             )
             else Booking end
        as Booking,
        movingAvg
    from df2 t1
""")

result.show()
+-------+---------+----------+-----------+-----------+
|name_id|VenueName|      date|    Booking|  movingAvg|
+-------+---------+----------+-----------+-----------+
|   1011|  Tortuga|2019-01-12|      100.0|75.14285714|
|   1367|  Tortuga|2020-01-12|75.14285714|37.42857143|
+-------+---------+----------+-----------+-----------+

相关问题