numpy 使用Pandas获取唯一分组中的最新日期[重复]

r7knjye2  于 2023-08-05  发布在  其他
关注(0)|答案(2)|浏览(124)

此问题在此处已有答案

Get the row(s) which have the max value in groups using groupby(15个回答)
28天前关闭。
按多个列分组,然后只采用唯一名称值的最近日期以及与之关联的所有列

数据

ID      name    size    stat    days    month   date        year
db11AA  cc      5       TRUE    10      June    6/1/2023    2023
db11AA  kj      9       FALSE   10      June    6/5/2023    2023
db11AA  cc      7       TRUE    10      June    6/2/2023    2023
db11AA  aa      2       TRUE    60      June    6/2/2023    2023
db22BB  bb      1       TRUE    10      June    6/30/2023   2023
db22BB  vl      2       FALSE   60      June    6/29/2023   2023
db11BB  ss      2       FALSE   10      April   4/2/2023    2023
db11BB  ss      2       FALSE   10      April   4/1/2023    2023
db67CC  la      1       FALSE   60      June    6/3/2024    2024
db67CC  la      0       FALSE   60      June    6/5/2024    2024
db11AA  cc      20      TRUE    10      May     5/1/2023    2024
db11AA  kj      30      FALSE   10      May     5/5/2023    2024

字符串

所需

ID      name    size    stat    days    month   date        year
db11AA  cc      7       TRUE    10      June    6/2/2023    2023
db11AA  kj      9       FALSE   10      June    6/5/2023    2023
db11AA  aa      2       TRUE    60      June    6/2/2023    2023
db22BB  bb      1       TRUE    10      June    6/30/2023   2023
db22BB  vl      2       FALSE   60      June    6/29/2023   2023
db11BB  ss      2       FALSE   10      April   4/2/2023    2023
db67CC  la      0       FALSE   60      June    6/5/2024    2024
db11AA  cc      20      TRUE    10      May     5/1/2023    2024
db11AA  kj      30      FALSE   10      May     5/5/2023    2024


逻辑:我们可以有重复的ID,但名称值必须是唯一的,并显示最近的日期。

在做

# Group the DataFrame by 'ID' and 'month' and select the row with the maximum 'size' value
df = df.groupby(['ID', 'month']).apply(lambda x: x.loc[x['date'].idxmax()])


我想我应该使用lambda,但不确定,因为上面的脚本仍然会重复行。任何建议都很感激。

8ehkhllq

8ehkhllq1#

IIUC,您可以尝试:

df['date'] = pd.to_datetime(df['date'])

out = (
    df.groupby(["ID", "name", "month"], sort=False)
    .apply(lambda x: x.loc[x["date"].idxmax()])
    .reset_index(drop=True)
)
print(out)

字符串
印刷品:

ID name  size   stat  days  month       date  year
0  db11AA   cc     7   True    10   June 2023-06-02  2023
1  db11AA   kj     9  False    10   June 2023-06-05  2023
2  db11AA   aa     2   True    60   June 2023-06-02  2023
3  db22BB   bb     1   True    10   June 2023-06-30  2023
4  db22BB   vl     2  False    60   June 2023-06-29  2023
5  db11BB   ss     2  False    10  April 2023-04-02  2023
6  db67CC   la     0  False    60   June 2024-06-05  2024
7  db11AA   cc    20   True    10    May 2023-05-01  2024
8  db11AA   kj    30  False    10    May 2023-05-05  2024

nwwlzxa7

nwwlzxa72#

你不需要在这里使用apply,但你也必须按name分组,正如我在你之前(已删除)的问题中所建议的那样:

idx = (df.assign(dt=pd.to_datetime(df['date']))
         .groupby(['name', 'ID', 'month'], sort=False)['dt']
         .idxmax())

filtered_df = df.loc[idx]

字符串
输出量:

>>> filtered_df
        ID name  size   stat  days  month       date  year
2   db11AA   cc     7   True    10   June   6/2/2023  2023
1   db11AA   kj     9  False    10   June   6/5/2023  2023
3   db11AA   aa     2   True    60   June   6/2/2023  2023
4   db22BB   bb     1   True    10   June  6/30/2023  2023
5   db22BB   vl     2  False    60   June  6/29/2023  2023
6   db11BB   ss     2  False    10  April   4/2/2023  2023
9   db67CC   la     0  False    60   June   6/5/2024  2024
10  db11AA   cc    20   True    10    May   5/1/2023  2024
11  db11AA   kj    30  False    10    May   5/5/2023  2024

相关问题