Pandas分组并比较行以找到最大值

1szpjjfi 于 2022-09-21 发布在其他

关注(0)|答案(3)|浏览(185)

我有一个数据框
A|b|c
-|-|
1|6|11
1|7|12
两个|8|23
两个|9个|14个
三个|10个|15个
三个|20个|25个

我希望应用groupby at column a，然后找到highest value in column c，以便标记最高值，即

A|b|c
-|-|
1|6|11
1|7|12

比较值11和12，然后

A|b|c
-|-|
两个|8|23
两个|9个|14个

比较值23和14，然后

A|b|c
-|-|
三个|10个|15个
三个|20个|25个

最终导致：

A|b|c|标志
-|-|
1|6|11|否
1|7|12|是
两个|8|23|是
二|9|14|否
三|10|15|否
三个|20|25|是

I/P DF：

df = pd.DataFrame({
    'a':["one","one","two","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,23,14,15,25]
    # , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df

pandas

来源：https://stackoverflow.com/questions/73783447/pandas-groupby-and-compare-rows-to-find-maximum-value

3条答案

按热度按时间

lokaqttq1#

您可以使用groupby.transform获取每个组的最大值，并使用numpy.where将True/FalseMap到'yes'/'no'：

df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')

输出：

a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

中间体：

df.groupby('a')['c'].transform('max')

0    12
1    12
2    23
3    23
4    25
5    25
Name: c, dtype: int64

df.groupby('a')['c'].transform('max').eq(df['c'])
0    False
1     True
2     True
3    False
4    False
5     True
Name: c, dtype: bool

赞(0）回复(0）举报 2022-09-21

e4yzc0pl2#

使用GroupBy.transform和max，比较到同一列c，然后在numpy.where中设置yes/no：

df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')

print(df)
       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

如果每个a具有最大值的多个值得到多个yes，如果只需要第一个最大值，则使用DataFrameGroupBy.idxmax并比较df.index：

df = pd.DataFrame({
    'a':["one","one","one","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,12,14,15,25]
})

df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')

print(df)

       a   b   c flag1 flag2
0    one   6  11    no    no
1    one   7  12   yes   yes
2    one   8  12   yes    no <- difference for match all max or first max
3    two   9  14   yes   yes
4  three  10  15    no    no
5  three  20  25   yes   yes

赞(0）回复(0）举报 2022-09-21

kt06eoxx3#

这样做的一种方法如下

df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

分解上面正在做的各种步骤

df['flag']创建名为flag的新列。
df.groupby('a')['c'].max()将按列a和pandas.DataFrame.groupby分组，并在列c中查找最大值。

df2 = df.groupby('a')['c'].max()

然后我们检查该值是否在步骤2中生成的数据框中以及是否相同。

df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)

备注：

检查组是否相同是关键，否则，即使它在此特定情况下工作，如果一个组的非最大值是另一个组的最大值(正如Mazway提到的)，它将不起作用。
正如Jezrael分享的indicated in the answer一样，.apply可能会很慢，即使可以工作，也可能不是最方便的方式。

赞(0）回复(0）举报 2022-09-21

我来回答

Pandas分组并比较行以找到最大值

3条答案

相关问题

热门标签

最新问答