python 如何根据另一列获取一列的累积最大值

zpf6vheq  于 2023-04-28  发布在  Python
关注(0)|答案(4)|浏览(154)

我有一个这样的DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "realization_id": np.repeat([0, 1], 6),
    "sample_size": np.tile([0, 1, 2], 4),
    "num_obs": np.tile(np.repeat([25, 100], 3), 2),
    "accuracy": [0.8, 0.7, 0.8, 0.6, 0.7, 0.5, 0.6, 0.7, 0.8, 0.7, 0.9, 0.7],
    "prob": [0.94, 0.96, 0.95, 0.98, 0.93, 0.92, 0.90, 0.92, 0.95, 0.9, 0.91, 0.92]
})

df["accum_max_prob"] = df.groupby(["realization_id", "num_obs"])["prob"].cummax()

我想知道如何创建一个输出如下的列:

df["desired_accuracy"] = [0.8, 0.7, 0.7, 0.6, 0.6, 0.6, 0.6, 0.7, 0.8, 0.7, 0.9, 0.7]

desired_accuracy的每个条目都等于accuracy的值,该值对应于迄今为止按组实现的最高prob的行(这就是我创建accum_max_prob的原因)。
例如:第一个值是0.8,因为在此之前没有数据,但是下一个值是0.7,因为第二行的prob大于第一行。第三个值保持不变,因为第三个prob低于第二个,所以它不更新desired_accuracy。对于每对(realization_id, num_obs),标准重置。
如何使用Pandas以矢量化的方式实现这一目标?

3vpjnl9f

3vpjnl9f1#

它看起来像:

df['desired_accuracy'] = df['accuracy'].mask(df['prob'].lt(df['accum_max_prob'])).ffill()

输出:

realization_id  sample_size  num_obs  accuracy  prob  accum_max_prob  desired_accuracy
0                0            0       25       0.8  0.94            0.94               0.8
1                0            1       25       0.7  0.96            0.96               0.7
2                0            2       25       0.8  0.95            0.96               0.7
3                0            0      100       0.6  0.98            0.98               0.6
4                0            1      100       0.7  0.93            0.98               0.6
5                0            2      100       0.5  0.92            0.98               0.6
6                1            0       25       0.6  0.90            0.90               0.6
7                1            1       25       0.7  0.92            0.92               0.7
8                1            2       25       0.8  0.95            0.95               0.8
9                1            0      100       0.7  0.90            0.90               0.7
10               1            1      100       0.9  0.91            0.91               0.9
11               1            2      100       0.7  0.92            0.92               0.7
bz4sfanl

bz4sfanl2#

尝试:

df["desired_accuracy_2"] = (
    df.groupby(["realization_id", "num_obs", "accum_max_prob"])["accuracy"]
    .transform("first")
)

图纸:

realization_id  sample_size  num_obs  accuracy  prob  accum_max_prob  desired_accuracy  desired_accuracy_2
0                0            0       25       0.8  0.94            0.94               0.8                 0.8
1                0            1       25       0.7  0.96            0.96               0.7                 0.7
2                0            2       25       0.8  0.95            0.96               0.7                 0.7
3                0            0      100       0.6  0.98            0.98               0.6                 0.6
4                0            1      100       0.7  0.93            0.98               0.6                 0.6
5                0            2      100       0.5  0.92            0.98               0.6                 0.6
6                1            0       25       0.6  0.90            0.90               0.6                 0.6
7                1            1       25       0.7  0.92            0.92               0.7                 0.7
8                1            2       25       0.8  0.95            0.95               0.8                 0.8
9                1            0      100       0.7  0.90            0.90               0.7                 0.7
10               1            1      100       0.9  0.91            0.91               0.9                 0.9
11               1            2      100       0.7  0.92            0.92               0.7                 0.7
91zkwejq

91zkwejq3#

只要保持probaccum_max_prob相等的值,以及ffill

df['desired_accuracy'] = df['accuracy'].where(df['accum_max_prob'].eq(df['prob'])).ffill()

请注意,您不需要每组ffill,**前提是prob中最初没有缺失值。**如果存在缺失值,则使用df['accuracy'].where(df['accum_max_prob'].eq(df['prob'])).groupby([df["realization_id"], df["num_obs"]]).ffill()
输出:

realization_id  sample_size  num_obs  accuracy  prob  accum_max_prob  desired_accuracy
0                0            0       25       0.8  0.94            0.94               0.8
1                0            1       25       0.7  0.96            0.96               0.7
2                0            2       25       0.8  0.95            0.96               0.7
3                0            0      100       0.6  0.98            0.98               0.6
4                0            1      100       0.7  0.93            0.98               0.6
5                0            2      100       0.5  0.92            0.98               0.6
6                1            0       25       0.6  0.90            0.90               0.6
7                1            1       25       0.7  0.92            0.92               0.7
8                1            2       25       0.8  0.95            0.95               0.8
9                1            0      100       0.7  0.90            0.90               0.7
10               1            1      100       0.9  0.91            0.91               0.9
11               1            2      100       0.7  0.92            0.92               0.7
luaexgnf

luaexgnf4#

这样做:
df['desired_accuracy ']= df['accuracy']。where(df['accum_max_prob'].eq(df['prob']))。填充

相关问题