创建显示以前赚取的薪资(GROPEDBY)的列

0pizxfdo  于 2022-10-07  发布在  Spark
关注(0)|答案(2)|浏览(295)

我正在尝试构建一个替代Python的.shift()函数。我已经很接近了,但需要最后一次接触才能让一切正常进行。这将是实现正确形式的GroupBy

我有一个这样的 Dataframe :
月|userID|金额|对手方_iban
-|-|
1|John Jones|2400|ENG12345
4|John Jones|1200|ENG12345
8|John Jones|2300|ENG12345
5|John Jones|3000|AM12345
9|John Jones|5000|AM12345
12|John Jones|6000|AM12345
1|Joe Jones|1200|AM12345
2|Joe Jones|2400|AM12345
3|Joe Jones|5000|AM12345

我需要将上面的 Dataframe 更改为下表的格式:

月|UserID|金额|对手方_iban|PERVICE_SALARY
-|
1|John Jones|2400|ENG12345|空
4|John Jones|1200|ENG12345|2400
8|John Jones|2300|ENG12345|1200
5|John Jones|3000|AM12345|空
9|John Jones|5000|AM12345|3000
12|John Jones|6000|AM12345|5000
1|Joe Jones|1200|AM12345|空
2|Joe Jones|2400|AM12345|1200
3|Joe Jones|5000|AM12345|2400

以下是创建输入 Dataframe 的代码:

columns = ["month", "userid", 'exactoriginalamount', 'counterparty_iban']
data = [("1", "John Jones", "2400", 'ENG12345'),
        ("4", "John Jones", "1200", 'ENG12345'),
        ("8", "John Jones", "2300", 'ENG12345'),
        ("5", "John Jones", "3000", 'AM12345'),
        ("9", "John Jones", "5000", 'AM12345'),
        ("12", "John Jones", "6000", 'AM12345'),
        ("1", "Joe Jones", "1200", 'AM12345'),
        ("2", "Joe Jones", "2400", 'AM12345'),
        ("3", "Joe Jones", "5000", 'AM12345')]

df = spark.createDataFrame(data=data, schema=columns)

我已经用以下代码尝试了许多应用程序:

w = Window().partitionBy().orderBy(F.col('userid'))
df = df.withColumn('previous_salary', F.lag('exactoriginalamount', 1).over(w))

但是,我需要以某种方式按“userid”和“对手方_iban”分组,以便不同的“preversary”列显示正确的数据。

w9apscun

w9apscun1#

您需要正确地描述分区。在示例输出中,我看到您想要基于“userid”和“对手方_iban”创建分区(窗口)。函数lag将在这些分区中单独运行。在下面的脚本中,我还将cast("long")用于“月”列,因为使用您的代码,“月”列最初是以字符串类型创建的(按字符串列排序将返回不同于按数字排序的排序顺序)。

w = Window.partitionBy("userid", "counterparty_iban").orderBy(F.col("month").cast("long"))
df = df.withColumn("previous_salary", F.lag("exactoriginalamount").over(w))

df.show()

# +-----+----------+-------------------+-----------------+---------------+

# |month|    userid|exactoriginalamount|counterparty_iban|previous_salary|

# +-----+----------+-------------------+-----------------+---------------+

# |    1| Joe Jones|               1200|          AM12345|           null|

# |    2| Joe Jones|               2400|          AM12345|           1200|

# |    3| Joe Jones|               5000|          AM12345|           2400|

# |    5|John Jones|               3000|          AM12345|           null|

# |    9|John Jones|               5000|          AM12345|           3000|

# |   12|John Jones|               6000|          AM12345|           5000|

# |    1|John Jones|               2400|         ENG12345|           null|

# |    4|John Jones|               1200|         ENG12345|           2400|

# |    8|John Jones|               2300|         ENG12345|           1200|

# +-----+----------+-------------------+-----------------+---------------+
8ftvxx2r

8ftvxx2r2#

你可以做这样的事情

从pyspk.sql.Window导入*

Custom_window=Window().partitionBy([“userid”]).rowsBetween(-1,-1).orderBy([“月”])

Df=df.with Column(“PERVICE_SALARY”,max(df.mount).over(Customer_Window))

相关问题