为每组的前N个唯一值过滤pandas Dataframe

6pp0gazn  于 2023-09-29  发布在  其他
关注(0)|答案(1)|浏览(86)

我需要过滤一个包含多个日期条目的大型变量数据集。在这种情况下,我只想保留在第一个日期输入的数据。
例如,在下面的数据集中:

dfex = pd.DataFrame({'names':['jim','jim','jim','jim','jim','jim','jim','jim','jim',
                           'bob','bob','bob','bob','bob','bob',
                           'sara','sara','sara','sara','sara','sara','sara','sara','sara','sara'],
                  'dates':['01-01-19','01-01-19','01-01-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19',
                           '01-05-19','01-05-19','01-07-19','01-08-19','01-09-19','01-10-19',
                           '01-02-19','01-02-19','01-02-19','01-02-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19']})

dfex['dates'] = pd.to_datetime(dfex['dates'])
dfex

Jim会保留前3排,Bob保留前2排,Sara保留前5排。

68bkxrlz

68bkxrlz1#

对于N = 1(原始问题),使用groupby.transform和布尔索引:

out = dfex[dfex.groupby('names')['dates'].transform('first').eq(dfex['dates'])]

如果初始日期以后在组中可以重复,并且您不想保留以后出现的日期,请用途:

out = dfex[dfex.groupby('names')['dates']
           .transform(lambda s: s.ne(s.shift()).cumsum()).eq(1)]

输出量:

names      dates
0    jim 2019-01-01
1    jim 2019-01-01
2    jim 2019-01-01
9    bob 2019-01-05
10   bob 2019-01-05
15  sara 2019-01-02
16  sara 2019-01-02
17  sara 2019-01-02
18  sara 2019-01-02

第一个N个唯一日期

不考虑其值(仅按位置)

N = 5
out = dfex[dfex.groupby('names')['dates']
           .transform(lambda s: pd. factorize(s)[0]<N)]

或者对于第一个N最早日期:

N = 5
out = dfex[dfex.groupby('names')['dates']
           .rank('dense').le(N)]

输出量:

names      dates
0    jim 2019-01-01
1    jim 2019-01-01
2    jim 2019-01-01
3    jim 2019-01-05
4    jim 2019-01-06
5    jim 2019-01-07
6    jim 2019-01-08
9    bob 2019-01-05
10   bob 2019-01-05
11   bob 2019-01-07
12   bob 2019-01-08
13   bob 2019-01-09
14   bob 2019-01-10
15  sara 2019-01-02
16  sara 2019-01-02
17  sara 2019-01-02
18  sara 2019-01-02
19  sara 2019-01-05
20  sara 2019-01-06
21  sara 2019-01-07
22  sara 2019-01-08

最后一个N唯一

N = 5
out = dfex[dfex[::-1].groupby('names')['dates']
           .transform(lambda s: pd. factorize(s)[0]<N)]

N最新日期:

N = 5
out = dfex[dfex.groupby('names')['dates']
           .rank('dense', ascending=False).le(N)]

输出量:

names      dates
4    jim 2019-01-06
5    jim 2019-01-07
6    jim 2019-01-08
7    jim 2019-01-09
8    jim 2019-01-10
9    bob 2019-01-05
10   bob 2019-01-05
11   bob 2019-01-07
12   bob 2019-01-08
13   bob 2019-01-09
14   bob 2019-01-10
20  sara 2019-01-06
21  sara 2019-01-07
22  sara 2019-01-08
23  sara 2019-01-09
24  sara 2019-01-10

相关问题