pandas与带有过滤器的sql count窗口函数的等价性是什么?

kuuvgm7e  于 2023-06-04  发布在  其他
关注(0)|答案(2)|浏览(227)

我试图找到pandas等价于一个带有过滤器的sql count窗口函数。下面是SQL查询:

select t.*,
    count(*) filter(where grp = 'new') over(partition by usr order by id) rn
from mytable t
order by usr, id

我按id排序后尝试了下面的方法:

mytable['rn'] = mytable.groupby('usr')['grp'].transform('count')

但这是错误的,因为我没有过滤的grp,因为我应该和我不知道如何做这样的过滤器。那么,正确的pandas等价物(矢量化解决方案)是什么?
注意:如需了解更多上下文,您可以参考此问题,但不必:link

vxf3dgd4

vxf3dgd41#

您可以用途:

d = {"new": 0, "same": 1}
​
mytable["group_number"] = (
    mytable.groupby("user", group_keys=False)["group"]
        .apply(lambda s: s.map(d).diff().eq(-1).cumsum().add(1))
)

输出:

print(mytable)

    user group  group_number
0      1   new             1
1      1  same             1
2      1  same             1
3      1   new             2
4      1  same             2
5      1   new             3
6      2   new             1
7      2  same             1
8      2   new             2
9      2  same             2
10     3   new             1
11     3  same             1
  • 使用的输入:*
mytable = pd.DataFrame(
    {"user": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
     "group": ["new", "same", "same", "new", "same",
               "new", "new", "same", "new", "same", "new", "same"]}
)
jfewjypa

jfewjypa2#

类似于等价的SQL标准表达式:

CASE 
  WHEN grp = 'new'
  THEN ROW_NUMBER OVER(PARTITION BY usr, ORDER BY id) 
END rn

Pandas等效性为:

mytable['rn'] = (
    mytable.assign(
        grp_flag = lambda df: (               # CASE
            df["grp"].eq('new').astype('Int64').replace(0, np.nan)
        )
    ).sort_values(['usr', 'grp_flag', 'id'])  # ORDER BY
    .groupby(['usr', 'grp_flag'])             # PARTITION BY
    .cumcount().add(1)                        # ROW_NUMBER
)

mytable['rn'] = mytable['rn'].ffill()         # FILL NANs

相关问题