azure 删除Pyspark DataFrame中具有正和负数量的相同数量的记录？

fhg3lkii 于 2023-10-22 发布在 Spark

关注(0)|答案(1)|浏览(113)

我有一个表有多个负数和正数，我想只删除那些记录从表中有负的数额，并有相同的正值。我不知道该怎么解释这个场景...
我将给予一个简单的例子-
我有一个有10条记录的数组，其中4条记录是负数，6条记录是正数。此外，只删除所有三列匹配的记录，即。（100 15-09-2023 F）至（-100 15-09-2023 F）但不至（-100 15-09-2023 S）

amount_debited |transaction_date |customer_tan

 100           |  15-09-2023         | F    
-100           |  15-09-2023         | F    
-100           |  15-09-2023         | F    
 100           |  15-09-2023         | S    
 100           |  15-09-2023         | F    
 500           |  19-09-2023         | F    
-500           |  19-09-2023         | F    
 500           |  19-09-2023         | F    
 500           |  19-09-2023         | S    
-600           |  19-09-2023         | F    
 400           |  19-09-2023         | F

-400           |  19-09-2023         | F

所以在这里我想删除相同数量的记录的负金额和正金额，所以我的输出应该是

amount_debited |transaction_date |customer_tan

 100           |  15-09-2023     |     S        
 500           |  19-09-2023     |     F    
 500           |  19-09-2023     |     S    
-600           |  19-09-2023     |     F

df = df.withColumn("idx", F.monotonically_increasing_id())
    windowSpec = W.orderBy("idx")
    df = df.withColumn("idx", F.row_number().over(windowSpec))

    df=df.withColumn("amount_debited_dummy",F.expr('abs(amount_debited)'))
     df= df.withColumn("positive_key",F.concat_ws("_",'transaction_date','customer_tan','amount_debited_dummy'))

    grouped_count = Window().partitionBy('positive_key')

    df =df.withColumn("sum_amount",F.sum('amount_debited').over(grouped_count))

卡在这里：当有相同数量的记录与正值和负值，我能够过滤后的记录与sum_amount=0（例如。-400）.但当条目是奇数，我不能挑选的id.

Azure

来源：https://stackoverflow.com/questions/77311609/deleting-equal-number-of-records-with-positive-and-negative-amount-in-a-pyspark

1条答案

按热度按时间

pinkon5k1#

检查下面的代码

spark.table("input").show(false)

+--------------+----------------+------------+
|amount_debited|transaction_date|customer_tan|
+--------------+----------------+------------+
|100           |15-09-2023      |F           |
|-100          |15-09-2023      |F           |
|-100          |15-09-2023      |F           |
|100           |15-09-2023      |S           |
|100           |15-09-2023      |F           |
|500           |19-09-2023      |F           |
|-500          |19-09-2023      |F           |
|500           |19-09-2023      |F           |
|500           |19-09-2023      |S           |
|-600          |19-09-2023      |F           |
|400           |19-09-2023      |F           |
|-400          |19-09-2023      |F           |
+--------------+----------------+------------+

spark.sql("""
WITH source AS (
          SELECT
              amount_debited,
              transaction_date,
              customer_tan,
              ROW_NUMBER() OVER (win_spec) AS row_id,
              LAST(ROW_NUMBER() OVER (win_spec)) OVER(win_spec) AS last_id
          FROM input
          WINDOW win_spec AS (PARTITION BY abs(amount_debited), transaction_date, customer_tan ORDER BY customer_tan ASC)
)
SELECT
     amount_debited,
     transaction_date,
     customer_tan,
     last_id,
     row_id
FROM source
""").show(false)

+--------------+----------------+------------+-------+------+
|amount_debited|transaction_date|customer_tan|last_id|row_id|
+--------------+----------------+------------+-------+------+
|100           |15-09-2023      |F           |4      |1     |
|-100          |15-09-2023      |F           |4      |2     |
|-100          |15-09-2023      |F           |4      |3     |
|100           |15-09-2023      |F           |4      |4     |

|100           |15-09-2023      |S           |1      |1     |

|400           |19-09-2023      |F           |2      |1     |
|-400          |19-09-2023      |F           |2      |2     |

|500           |19-09-2023      |F           |3      |1     |
|-500          |19-09-2023      |F           |3      |2     |
|500           |19-09-2023      |F           |3      |3     |

|500           |19-09-2023      |S           |1      |1     |

|-600          |19-09-2023      |F           |1      |1     |
+--------------+----------------+------------+-------+------+

spark.sql("""
WITH source AS (
          SELECT
              amount_debited,
              transaction_date,
              customer_tan,
              ROW_NUMBER() OVER (win_spec) AS row_id,
              LAST(ROW_NUMBER() OVER (win_spec)) OVER(win_spec) AS last_id
          FROM input
          WINDOW win_spec AS (PARTITION BY abs(amount_debited), transaction_date, customer_tan ORDER BY customer_tan ASC)
)
SELECT
     amount_debited,
     transaction_date,
     customer_tan,
     last_id,
     row_id
FROM source
WHERE last_id % 2 != 0 AND row_id = last_id
""").show(false)

+--------------+----------------+------------+-------+------+
|amount_debited|transaction_date|customer_tan|last_id|row_id|
+--------------+----------------+------------+-------+------+
|100           |15-09-2023      |S           |1      |1     |
|500           |19-09-2023      |F           |3      |3     |
|500           |19-09-2023      |S           |1      |1     |
|-600          |19-09-2023      |F           |1      |1     |
+--------------+----------------+------------+-------+------+

赞(0）回复(0）举报 2023-10-22

我来回答

azure 删除Pyspark DataFrame中具有正和负数量的相同数量的记录？

1条答案

相关问题

热门标签

最新问答