pyspark-drop基于相似行的行

vwkv1x7d  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(300)

我需要删除所有日期值为0的行,其中相同的日期值为1。

df=spark.createDataFrame([("A1", "2016-10-01", 1), ("A1", "2016-10-01", 0), ("A1", "2016-10-05", 1), ("A3", "2016-10-06", 1), ("A3", "2016-10-07", 0)], ["id", "date", "value"])

+---+----------+-----+
| id|      date|value|
+---+----------+-----+
| A1|2016-10-01|    1|
| A1|2016-10-01|    0|
| A1|2016-10-05|    1|
| A3|2016-10-06|    1|
| A3|2016-10-07|    0|
+---+----------+-----+

所需的Dataframe:注意 ID: A12016-10-01 有两个值,1和0。现在它的值只有1。
如果每个组的值为0的同一日期存在值1,则需要删除值0。

+---+----------+-----+
| id|      date|value|
+---+----------+-----+
| A1|2016-10-01|    1|
| A1|2016-10-05|    1|
| A3|2016-10-06|    1|
| A3|2016-10-07|    0|
+---+----------+-----+
rqcrx0a6

rqcrx0a61#

只是需要一些 Window 魔术✨

from pyspark.sql import functions as F, Window

df.withColumn("max_value", F.max("value").over(Window.partitionBy("id", "date"))).where(
    "value = max_value"
).drop("max_value").show()

+---+----------+-----+
| id|      date|value|
+---+----------+-----+
| A1|2016-10-05|    1|
| A1|2016-10-01|    1|
| A3|2016-10-07|    0|
| A3|2016-10-06|    1|
+---+----------+-----+
ulydmbyx

ulydmbyx2#

df.groupBy('id', 'date').agg(max('value').alias('value')).show()

+---+----------+----------+
| id|      date|   value  |
+---+----------+----------+
| A3|2016-10-06|         1|
| A3|2016-10-07|         0|
| A1|2016-10-05|         1|
| A1|2016-10-01|         1|
+---+----------+----------+

相关问题