pyspark:如何计算满足最后一个条件(正vs负)之间的天数

vaqhlq81  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(774)

当前df(按单个userid过滤,当丢失大于0时标志为1,当小于等于0时标志为1):

display(df):
+------+----------+---------+----+
|  user|Date      |RealLoss |flag|
+------+----------+---------+----+
|100364|2019-02-01|    -16.5|   1|
|100364|2019-02-02|     73.5|  -1|
|100364|2019-02-03|       31|  -1|
|100364|2019-02-09|     -5.2|   1|
|100364|2019-02-10|    -34.5|   1|
|100364|2019-02-13|     -8.1|   1|
|100364|2019-02-18|     5.68|  -1|
|100364|2019-02-19|     5.76|  -1|
|100364|2019-02-20|     9.12|  -1|
|100364|2019-02-26|      9.4|  -1|
|100364|2019-02-27|    -30.6|   1|
+----------+------+---------+----+

desidered outcome df应该显示自lastwin('recencylastwin')和自lastloss('recencylastloss')以来的天数

display(df):
+------+----------+---------+----+--------------+---------------+
|  user|Date      |RealLoss |flag|RecencyLastWin|RecencyLastLoss|
+------+----------+---------+----+--------------+---------------+
|100364|2019-02-01|    -16.5|   1|          null|           null|
|100364|2019-02-02|     73.5|  -1|             1|           null|
|100364|2019-02-03|       31|  -1|             2|              1|     
|100364|2019-02-09|     -5.2|   1|             8|              6|
|100364|2019-02-10|    -34.5|   1|             1|              7|
|100364|2019-02-13|     -8.1|   1|             1|             10|
|100364|2019-02-18|     5.68|  -1|             5|             15|         
|100364|2019-02-19|     5.76|  -1|             6|              1|
|100364|2019-02-20|     9.12|  -1|             7|              1|
|100364|2019-02-26|      9.4|  -1|            13|              6|
|100364|2019-02-27|    -30.6|   1|            14|              1|
+----------+------+---------+----+--------------+---------------+

我的方法如下:

from pyspark.sql.window import Window

w = Window.partitionBy("userId", 'PlayerSiteCode').orderBy("EventDate")

last_positive = check.filter('flag = "1"').withColumn('last_positive_day' , F.lag('EventDate').over(w))
last_negative = check.filter('flag = "-1"').withColumn('last_negative_day' , F.lag('EventDate').over(w))

finalcheck = check.join(last_positive.select('userId', 'PlayerSiteCode', 'EventDate', 'last_positive_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
                  .join(last_negative.select('userId', 'PlayerSiteCode', 'EventDate', 'last_negative_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
                  .withColumn('previous_date_played'         , F.lag('EventDate').over(w))\
                  .withColumn('last_positive_day_count', F.datediff(F.col('EventDate'), F.col('last_positive_day')))\
                  .withColumn('last_negative_day_count', F.datediff(F.col('EventDate'), F.col('last_negative_day')))

然后我尝试添加(多次尝试…),但未能“完美”返回我想要的。

finalcheck = finalcheck.withColumn('previous_last_pos'           , F.last('last_positive_day_count', True).over(w2))\
                        .withColumn('previous_last_neg'           , F.last('last_negative_day_count', True).over(w2))\
                        .withColumn('previous_last_pos_date'      , F.last('last_positive_day', True).over(w2))\
                        .withColumn('previous_last_neg_date'      , F.last('last_negative_day', True).over(w2))\
                        .withColumn('recency_last_positive'       , F.datediff(F.col('EventDate'), F.col('previous_last_pos_date')))\
                        .withColumn('day_since_last_negative_v1'  , F.datediff(F.col('EventDate'), F.col('previous_last_neg_date')))\
                        .withColumn('days_off'                    , F.datediff(F.col('EventDate'), F.col('previous_date_played')))\
                        .withColumn('recency_last_negative'       , F.when((F.col('day_since_last_negative_v1').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
                        .withColumn('recency_last_negative_v2'    , F.when((F.col('last_negative_day').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
                        .withColumn('recency_last_positive_v2'    , F.when((F.col('last_positive_day').isNull()), F.col('days_off')).otherwise(F.col('recency_last_positive')))

有什么建议/提示吗(我发现了一个类似的问题,但没有弄清楚如何在我的具体案例中应用):如何计算满足最后一个条件之间的天数?

hs1ihplo

hs1ihplo1#

这是我的尝试。
有两个部分来计算。第一个是当胜负持续时,就要把日期差加起来。为了实现这一点,我将连续的输赢标记为1,并通过累计求和将它们分成分区组,直到标记的当前行为止。然后,我可以计算出从上一次输赢到连续输赢结束后的累计天数。
第二种是当胜负发生变化时,只需得到上一场比赛和本场比赛的日期差。它可以很容易地得到的日期差异,目前和以前的一个。
最后,将这些结果合并到一列中。

from pyspark.sql.functions import lag, col, sum
from pyspark.sql import Window

w1 = Window.orderBy('Date')
w2 = Window.partitionBy('groupLossCheck').orderBy('Date')
w3 = Window.partitionBy('groupWinCheck').orderBy('Date')

df2 = df.withColumn('lastFlag', lag('flag', 1).over(w1)) \
  .withColumn('lastDate', lag('Date', 1).over(w1)) \
  .withColumn('dateDiff', expr('datediff(Date, lastDate)')) \
  .withColumn('consecutiveLoss', expr('if(flag =  1 or lastFlag =  1, 0, 1)')) \
  .withColumn('consecutiveWin' , expr('if(flag = -1 or lastFlag = -1, 0, 1)')) \
  .withColumn('groupLossCheck', sum('consecutiveLoss').over(w1)) \
  .withColumn('groupWinCheck' , sum('consecutiveWin' ).over(w1)) \
  .withColumn('daysLastLoss', sum(when((col('consecutiveLoss') == 0) & (col('groupLossCheck') != 0), col('dateDiff'))).over(w2)) \
  .withColumn('daysLastwin' , sum(when((col('consecutiveWin' ) == 0) & (col('groupWinCheck' ) != 0), col('dateDiff'))).over(w3)) \
  .withColumn('lastLoss', expr('if(lastFlag = -1, datediff, null)')) \
  .withColumn('lastWin' , expr('if(lastFlag =  1, dateDiff, null)')) \
  .withColumn('RecencyLastLoss', coalesce('lastLoss', 'daysLastLoss')) \
  .withColumn('RecencyLastWin',  coalesce('lastWin' , 'daysLastwin' )) \
  .orderBy('Date')

df2.show(11, False)

+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|user  |Date      |RealLoss|flag|lastFlag|lastDate  |dateDiff|consecutiveLoss|consecutiveWin|groupLossCheck|groupWinCheck|daysLastLoss|daysLastwin|lastLoss|lastWin|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|100364|2019-02-01|-16.5   |1   |null    |null      |null    |0              |1             |0             |1            |null        |null       |null    |null   |null           |null          |
|100364|2019-02-02|73.5    |-1  |1       |2019-02-01|1       |0              |0             |0             |1            |null        |1          |null    |1      |null           |1             |
|100364|2019-02-03|31.0    |-1  |-1      |2019-02-02|1       |1              |0             |1             |1            |null        |2          |1       |null   |1              |2             |
|100364|2019-02-09|-5.2    |1   |-1      |2019-02-03|6       |0              |0             |1             |1            |6           |8          |6       |null   |6              |8             |
|100364|2019-02-10|-34.5   |1   |1       |2019-02-09|1       |0              |1             |1             |2            |7           |null       |null    |1      |7              |1             |
|100364|2019-02-13|-8.1    |1   |1       |2019-02-10|3       |0              |1             |1             |3            |10          |null       |null    |3      |10             |3             |
|100364|2019-02-18|5.68    |-1  |1       |2019-02-13|5       |0              |0             |1             |3            |15          |5          |null    |5      |15             |5             |
|100364|2019-02-19|5.76    |-1  |-1      |2019-02-18|1       |1              |0             |2             |3            |null        |6          |1       |null   |1              |6             |
|100364|2019-02-20|9.12    |-1  |-1      |2019-02-19|1       |1              |0             |3             |3            |null        |7          |1       |null   |1              |7             |
|100364|2019-02-26|9.4     |-1  |-1      |2019-02-20|6       |1              |0             |4             |3            |null        |13         |6       |null   |6              |13            |
|100364|2019-02-27|-30.6   |1   |-1      |2019-02-26|1       |0              |0             |4             |3            |1           |14         |1       |null   |1              |14            |
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+

df2.select(*df.columns, 'RecencyLastLoss', 'RecencyLastWin').show(11, False)

+------+----------+--------+----+---------------+--------------+
|user  |Date      |RealLoss|flag|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+---------------+--------------+
|100364|2019-02-01|-16.5   |1   |null           |null          |
|100364|2019-02-02|73.5    |-1  |null           |1             |
|100364|2019-02-03|31.0    |-1  |1              |2             |
|100364|2019-02-09|-5.2    |1   |6              |8             |
|100364|2019-02-10|-34.5   |1   |7              |1             |
|100364|2019-02-13|-8.1    |1   |10             |3             |
|100364|2019-02-18|5.68    |-1  |15             |5             |
|100364|2019-02-19|5.76    |-1  |1              |6             |
|100364|2019-02-20|9.12    |-1  |1              |7             |
|100364|2019-02-26|9.4     |-1  |6              |13            |
|100364|2019-02-27|-30.6   |1   |1              |14            |
+------+----------+--------+----+---------------+--------------+

相关问题