用Pandas的grouped .agg计算加权平均值

am46iovg  于 2023-02-20  发布在  其他
关注(0)|答案(5)|浏览(181)

我想使用panda中的.agg()函数按组计算数据集中一列的平均值和另一列的加权平均值。我知道一些解决方案,但它们不是很简洁。
这里已经发布了一个解决方案(pandas and groupby: how to calculate weighted averages within an agg,但它看起来仍然不是很灵活,因为weights列是在lambda函数定义中硬编码的。

(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
     wt_avg_y=('y', 'weighted_mean', weights='weight')
)

下面是一个完整的示例,其中的代码似乎过于复杂:

import pandas as pd
import numpy as np

# sample dataset
df = pd.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>>    group   x   y   weights
#>>> 0      a   1   5   0.75
#>>> 1      a   2   6   0.25
#>>> 2      b   3   7   0.75
#>>> 3      b   4   8   0.25

# aggregation logic
summary = pd.concat(
    [
        df.groupby(['group']).x.mean(),
        df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
    ], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})

# final output
summary
#>>>    group   avg_x   wt_avg_y
#>>> 0      a   1.50    5.25
#>>> 1      b   3.50    7.25
xxls0lw8

xxls0lw81#

对整个DataFrame使用.apply()方法是我能想到的最简单的解决方案,它 * 不 * 在函数定义中硬编码列名。

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['a', 'a', 'b', 'b'],
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'weights': [0.75, 0.25, 0.75, 0.25]
})

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.mean(x['x']),
            np.average(x['y'], weights=x['weights'])
        ], index=['avg_x', 'wt_avg_y'])
    )
    .reset_index()
)
# final output
summary
#>>>    group   avg_x   wt_avg_y
#>>> 0      a   1.50    5.25
#>>> 1      b   3.50    7.25
brqmpdu1

brqmpdu12#

不如这样:

grouped = df.groupby('group')

def wavg(group):
    group['mean_x'] = group['x'].mean()
    group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
    return group

grouped.apply(wavg)
zdwk9cvp

zdwk9cvp3#

试试看:

df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])

res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})

输出:

group    x     y
0     a  1.5  5.25
1     b  3.5  7.25
4bbkushb

4bbkushb4#

由于组内的权重之和为1,因此可以像往常一样分配一个新的列和groupby:

(df.assign(wt_avg_y=df['y']*df['weights'])
  .groupby('group')
  .agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
  .assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)

输出:

x  wt_avg_y  weights
group                        
a      1.5      5.25      1.0
b      3.5      7.25      1.0
rm5edbpk

rm5edbpk5#

Steven M. Mortimer的解决方案简洁易读。或者,可以在pd.Series()中使用dict符号,这样就不需要index=参数。在我看来,这提供了更好的可读性。

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series({
            'avg_x' : np.mean(x['x']),
            'wt_avg_y':  np.average(x['y'], weights=x['weights'])
        }))
    .reset_index()
)

相关问题