azure 删除Pyspark DataFrame中具有正和负数量的相同数量的记录?

fhg3lkii  于 2023-10-22  发布在  Spark
关注(0)|答案(1)|浏览(113)

我有一个表有多个负数和正数,我想只删除那些记录从表中有负的数额,并有相同的正值。我不知道该怎么解释这个场景...
我将给予一个简单的例子-
我有一个有10条记录的数组,其中4条记录是负数,6条记录是正数。此外,只删除所有三列匹配的记录,即。(100 15-09-2023 F)至(-100 15-09-2023 F)但不至(-100 15-09-2023 S)

amount_debited |transaction_date |customer_tan

 100           |  15-09-2023         | F    
-100           |  15-09-2023         | F    
-100           |  15-09-2023         | F    
 100           |  15-09-2023         | S    
 100           |  15-09-2023         | F    
 500           |  19-09-2023         | F    
-500           |  19-09-2023         | F    
 500           |  19-09-2023         | F    
 500           |  19-09-2023         | S    
-600           |  19-09-2023         | F    
 400           |  19-09-2023         | F

-400           |  19-09-2023         | F

所以在这里我想删除相同数量的记录的负金额和正金额,所以我的输出应该是

amount_debited |transaction_date |customer_tan

 100           |  15-09-2023     |     S        
 500           |  19-09-2023     |     F    
 500           |  19-09-2023     |     S    
-600           |  19-09-2023     |     F
df = df.withColumn("idx", F.monotonically_increasing_id())
    windowSpec = W.orderBy("idx")
    df = df.withColumn("idx", F.row_number().over(windowSpec))

    df=df.withColumn("amount_debited_dummy",F.expr('abs(amount_debited)'))
     df= df.withColumn("positive_key",F.concat_ws("_",'transaction_date','customer_tan','amount_debited_dummy'))

    grouped_count = Window().partitionBy('positive_key')

    df =df.withColumn("sum_amount",F.sum('amount_debited').over(grouped_count))

卡在这里:当有相同数量的记录与正值和负值,我能够过滤后的记录与sum_amount=0(例如。-400).但当条目是奇数,我不能挑选的id.

pinkon5k

pinkon5k1#

检查下面的代码

spark.table("input").show(false)

+--------------+----------------+------------+
|amount_debited|transaction_date|customer_tan|
+--------------+----------------+------------+
|100           |15-09-2023      |F           |
|-100          |15-09-2023      |F           |
|-100          |15-09-2023      |F           |
|100           |15-09-2023      |S           |
|100           |15-09-2023      |F           |
|500           |19-09-2023      |F           |
|-500          |19-09-2023      |F           |
|500           |19-09-2023      |F           |
|500           |19-09-2023      |S           |
|-600          |19-09-2023      |F           |
|400           |19-09-2023      |F           |
|-400          |19-09-2023      |F           |
+--------------+----------------+------------+
spark.sql("""
WITH source AS (
          SELECT
              amount_debited,
              transaction_date,
              customer_tan,
              ROW_NUMBER() OVER (win_spec) AS row_id,
              LAST(ROW_NUMBER() OVER (win_spec)) OVER(win_spec) AS last_id
          FROM input
          WINDOW win_spec AS (PARTITION BY abs(amount_debited), transaction_date, customer_tan ORDER BY customer_tan ASC)
)
SELECT
     amount_debited,
     transaction_date,
     customer_tan,
     last_id,
     row_id
FROM source
""").show(false)

+--------------+----------------+------------+-------+------+
|amount_debited|transaction_date|customer_tan|last_id|row_id|
+--------------+----------------+------------+-------+------+
|100           |15-09-2023      |F           |4      |1     |
|-100          |15-09-2023      |F           |4      |2     |
|-100          |15-09-2023      |F           |4      |3     |
|100           |15-09-2023      |F           |4      |4     |

|100           |15-09-2023      |S           |1      |1     |

|400           |19-09-2023      |F           |2      |1     |
|-400          |19-09-2023      |F           |2      |2     |

|500           |19-09-2023      |F           |3      |1     |
|-500          |19-09-2023      |F           |3      |2     |
|500           |19-09-2023      |F           |3      |3     |

|500           |19-09-2023      |S           |1      |1     |

|-600          |19-09-2023      |F           |1      |1     |
+--------------+----------------+------------+-------+------+
spark.sql("""
WITH source AS (
          SELECT
              amount_debited,
              transaction_date,
              customer_tan,
              ROW_NUMBER() OVER (win_spec) AS row_id,
              LAST(ROW_NUMBER() OVER (win_spec)) OVER(win_spec) AS last_id
          FROM input
          WINDOW win_spec AS (PARTITION BY abs(amount_debited), transaction_date, customer_tan ORDER BY customer_tan ASC)
)
SELECT
     amount_debited,
     transaction_date,
     customer_tan,
     last_id,
     row_id
FROM source
WHERE last_id % 2 != 0 AND row_id = last_id
""").show(false)

+--------------+----------------+------------+-------+------+
|amount_debited|transaction_date|customer_tan|last_id|row_id|
+--------------+----------------+------------+-------+------+
|100           |15-09-2023      |S           |1      |1     |
|500           |19-09-2023      |F           |3      |3     |
|500           |19-09-2023      |S           |1      |1     |
|-600          |19-09-2023      |F           |1      |1     |
+--------------+----------------+------------+-------+------+

相关问题