groupby函数

jdgnovmf 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(327)

我是新的Spark，我需要一些帮助，在应用基于条件的groupby函数。下面是我目前的输出

+----------+------------------+-----------------+----------+---------+------------+------+----------+--------+----------------+
|account_id|credit_card_Number|credit_card_limit|first_name|last_name|phone_number|amount|      date|    shop|transaction_code|
+----------+------------------+-----------------+----------+---------+------------+------+----------+--------+----------------+
|     12345|      123456789123|           100000|       abc|      xyz|  1234567890|  1000|01/06/2020|  amazon|             buy|
|     12345|      123456789123|           100000|       abc|      xyz|  1234567890|  1100|02/06/2020|    ebay|             buy|
|     12345|      123456789123|           100000|       abc|      xyz|  1234567890|   500|02/06/2020|  amazon|            sell|
|     12345|      123456789123|           100000|       abc|      xyz|  1234567890|   200|03/06/2020|flipkart|             buy|
|     12345|      123456789123|           100000|       abc|      xyz|  1234567890|  4000|04/06/2020|    ebay|             buy|
|     12345|      123456789123|           100000|       abc|      xyz|  1234567890|   900|05/06/2020|  amazon|             buy|
+----------+------------------+-----------------+----------+---------+------------+------+----------+--------+----------------+

我需要使用日期分组，除此之外，我还需要根据交易代码中的“买入”或“卖出”为该日期创建余额的附加列。
例如，对于第一行，金额是1000，交易代码是'buy'，因此我从信用额度（100000）中减去1000，并在新列中创建一个新值90000。
对于第二行，我们有两个值，一个是buy（1100），另一个是sell（500），这里我应该从前一行输出（即90000）中减去1100，再加上500。所以2020年6月2日的产量是98400
期望输出附加了上述Dataframe的附加列

Credit_left
99000
98400
98200
94200
93300

下面是这个表的模式

root
 |-- account_id: long (nullable = true)
 |-- credit_card_Number: long (nullable = true)
 |-- credit_card_limit: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_number: long (nullable = true)
 |-- amount: long (nullable = true)
 |-- date: string (nullable = true)
 |-- shop: string (nullable = true)
 |-- transaction_code: string (nullable = true)

这是一个如此复杂的任务，所以我找不到所需的答案。请帮助我解决这个问题。非常感谢！

hadoop apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/62650136/groupby-function-on-dataframe-using-conditions-in-pyspark

1条答案

按热度按时间

sbdsn5lh1#

解决方案可以实现为

from pyspark.sql import Window
from pyspark.sql.functions import *
import pyspark.sql.functions as f

w = Window.orderBy('date')

df.groupBy('date','credit_card_limit','credit_card_Number').agg(f.sum(f.when(f.col('transaction_code')=='buy',-f.col('amount')).\
              otherwise(f.col('amount'))).alias('expenses')).\
    select('*',(f.col('credit_card_limit')+f.sum(f.col('expenses')).over(w)).alias('Credit_left')).show()

----------+-----------------+------------------+--------+-----------+
|      date|credit_card_limit|credit_card_Number|expenses|Credit_left|
+----------+-----------------+------------------+--------+-----------+
|01/06/2020|           100000|      123456789123| -1000.0|    99000.0|
|02/06/2020|           100000|      123456789123|  -600.0|    98400.0|
|03/06/2020|           100000|      123456789123|  -200.0|    98200.0|
|04/06/2020|           100000|      123456789123| -4000.0|    94200.0|
|05/06/2020|           100000|      123456789123|  -900.0|    93300.0|
+----------+-----------------+------------------+--------+-----------+

希望有帮助：）

赞(0）回复(0）举报 2021-05-27

我来回答

groupby函数

1条答案

相关问题

热门标签

最新问答