sql—基于pyspark中的一些复杂逻辑执行某些列

xu3bshqb 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(383)

以下是所附图片中的问题：
表格：

Row Col1    Col2    Col3    Result
1   10       20      100    30
2   20       40      200    60
3   30       60       0     240
4   40       70       0     180
5   30       80      50     110
6   25       35       0      65
7   10       20      60      30

因此，结果列是根据以下规则计算的：
如果col3>0，则结果=col1+col2
如果col3=0，则结果=sum（col2），直到col3>0+col1（其中col3>0）
例如，对于行=3，结果=60+70+80+30（来自第5行的col1，因为这里col3>0）=240对于行=4，结果=70+80+30（来自第5行的col1，因为这里col3>0）=180对于其他行也类似

sql apache-spark pyspark apache-spark-sql pandas

来源：https://stackoverflow.com/questions/62950700/doing-some-of-columns-based-on-some-complex-logic-in-pyspark

1条答案

按热度按时间

gfttwv5a1#

这回答了（正确的，我可以补充）问题的原始版本。
在sql中，可以使用窗口函数来表示这一点。使用累计和定义组，并使用其他累计和：

select t.*,
       (case when col3 <> 0 then col1 + col2
             else sum(col2 + case when col3 = 0 then col1 else 0 end) over (partition by grp order by row desc)
        end) as result
from (select t.*,
             sum(case when col3 <> 0 then 1 else 0 end) over (order by row desc) as grp
      from t
     ) t;

这里有一个db<>fiddle（使用postgres）。
注：
你的描述说 else 逻辑应该是：

else sum(col2) over (partition by grp order by row desc)

你的例子是：

else sum(col2 + col3) over (partition by grp order by row desc)

在我看来，这似乎是最符合逻辑的：

else sum(col1 + col2) over (partition by grp order by row desc)

赞(0）回复(0）举报 2021-05-27

我来回答

sql—基于pyspark中的一些复杂逻辑执行某些列

1条答案

相关问题

热门标签

最新问答