我试图通过下面的pysparkDataframe来组合更新的\u mo值,但似乎无法解决这个问题。
我有这个Dataframe:
+--------------+----------+----------+---------+----------+---------+
|First_Purchase|Renewal_Mo|second_buy|third_buy|fourth_buy|fifth_buy|
+--------------+----------+----------+---------+----------+---------+
|6 |1 |1 |0 |0 |0 |
|6 |12 |36 |0 |0 |0 |
|6 |24 |4 |0 |0 |0 |
|6 |18 |2 |0 |0 |0 |
|6 |3 |6 |0 |0 |0 |
|6 |2 |8 |0 |0 |0 |
|6 |36 |1 |0 |0 |0 |
|6 |6 |12 |0 |0 |0 |
|6 |12 |0 |1 |0 |0 |
|6 |3 |0 |1 |0 |0 |
|6 |2 |0 |7 |0 |0 |
|6 |6 |0 |1 |0 |0 |
|6 |1 |0 |0 |1 |0 |
|6 |12 |0 |0 |1 |0 |
+--------------+----------+----------+---------+----------+---------+
并希望合并更新\u mo中的值,以便不存在重复项并生成此Dataframe:
|First_Purchase|Renewal_Mo|second_buy|third_buy|fourth_buy|fifth_buy|
+--------------+----------+----------+---------+----------+---------+
|6 |1 |1 |0 |1 |0 |
|6 |12 |36 |1 |1 |0 |
|6 |24 |4 |0 |0 |0 |
|6 |18 |2 |0 |0 |0 |
|6 |3 |6 |1 |0 |0 |
|6 |2 |8 |7 |0 |0 |
|6 |36 |1 |0 |0 |0 |
|6 |6 |12 |1 |0 |0 |
+--------------+----------+----------+---------+----------+---------+
但groupby似乎是错误的方法,因为它需要传递聚合函数。我可以用windows分区吗?有别的办法吗?我错过了什么?
如果我尝试
foo.groupby('First_Purchase','Renewal_Mo').count().show(truncate=False)
当然,我丢失了buy列,它只统计续费单的示例。我不知道如何处理这个来获得上面所需的Dataframe
+--------------+----------+-----+
|First_Purchase|Renewal_Mo|count|
+--------------+----------+-----+
|6 |1 |2 |
|6 |12 |3 |
|6 |24 |1 |
|6 |18 |1 |
|6 |3 |2 |
|6 |2 |2 |
|6 |36 |1 |
|6 |6 |2 |
+--------------+----------+-----+
2条答案
按热度按时间6kkfgxo01#
我不明白你为什么说groupby是一个错误的方法,因为它需要一个聚合函数。我要做的唯一方法是分组和聚合,有一个内置的函数sum,它完全满足您的要求:
唯一发生的事情是这个方法改变了列名,但是你可以用多种方法轻松地解决它。
zbwhf8kr2#
我也认为
groupBy
这是合理的。