pandas 按组归一化计数值

4nkexdtk  于 2023-01-07  发布在  其他
关注(0)|答案(1)|浏览(152)

我有一个数据表,其中公司有不同数量的位置,每个键有不同的值。
| 关键词|公司|价值|位置|
| - ------|- ------|- ------|- ------|
| 开始|汉堡包|星期一|雪梨|
| 开始|汉堡包|星期日|布里斯班|
| 开始|甜甜圈|星期一|雪梨|
我希望计算不同的值,但要进行标准化,以便具有多个位置的公司不会扭曲事情。我希望的输出如下所示:
| 价值|计数|
| - ------|- ------|
| 星期一|1.5岁|
| 星期日|0.5分|
不断敲打我的头对如何处理这一点,所以任何帮助是非常感谢。

iecba09b

iecba09b1#

我觉得你在找这样的东西:

df1 = df.groupby(['Key', 'Value', 'Company'], as_index=False).agg({'Location': 'count'})
df1['Total Locations'] = df1.groupby(['Key', 'Company'], as_index=False)['Location'].transform(lambda grp: grp.sum())
df1['Location Rate'] = df1['Location'] / df1['Total Locations']
final_df = df1.groupby(['Key', 'Value'], as_index=False).agg({'Location Rate': 'sum'})
final_df
    • 输出:**

完整示例

下面是完整的示例,包括导入、问题中包含的示例数据以及关于每个步骤的一些注解:

# == Necessary Imports ==================================================
import pandas as pd

# == Example Data =======================================================
df = pd.DataFrame(
    [['Start','Burgers','Monday','Sydney'],
    ['Start','Burgers','Sunday','Brisbane'],
    ['Start','Donuts','Monday','Sydney']],
    columns=['Key', 'Company', 'Value', 'Location'],
)

# == Solution ===========================================================

# 1. Count the number of "Locations" that each set of ['Key', 'Value', 'Company'] has.
df1 = df.groupby(['Key', 'Value', 'Company'], as_index=False).agg({'Location': 'count'})
# df1:
#      Key   Value  Company  Location
# 0  Start  Monday  Burgers         1
# 1  Start  Monday   Donuts         1
# 2  Start  Sunday  Burgers         1

# 2. Using the `.transform` method, count the total number
#    of Locations that each set of ['Key', 'Company'] has
df1['Total Locations'] = df1.groupby(['Key', 'Company'], as_index=False)['Location'].transform(lambda grp: grp.sum())
# df1:
#      Key   Value  Company  Location  Total Locations
# 0  Start  Monday  Burgers         1                2
# 1  Start  Monday   Donuts         1                1
# 2  Start  Sunday  Burgers         1                2

# 3. Divide the Location count by the total number of locations
#    previously obtained.
df1['Location Rate'] = df1['Location'] / df1['Total Locations']
# df1:
#      Key   Value  Company  Location  Total Locations  Location Rate
# 0  Start  Monday  Burgers         1                2            0.5
# 1  Start  Monday   Donuts         1                1            1.0
# 2  Start  Sunday  Burgers         1                2            0.5

# 4. Group by each set of 'Key' and 'Value' and sum all the 'Location Rate'
#    values obtained from the previous step
final_df = df1.groupby(['Key', 'Value'], as_index=False).agg({'Location Rate': 'sum'})
# df_final:
#      Key   Value  Location Rate
# 0  Start  Monday            1.5
# 1  Start  Sunday            0.5

使用方法链的解决方案

如果您不想创建中间的Pandas数据框,如df1,您可以将所有步骤组合成一个命令链,如下所示:

final_df = (
    df
    .groupby(['Key', 'Value', 'Company'], as_index=False)
    .agg({'Location': 'count'})
    .assign(
        Total_Locations=lambda xdf: xdf.groupby(['Key', 'Company'], as_index=False)['Location']
        .transform(lambda grp: grp.sum())
    )
    .assign(Location_Rate=lambda xdf: xdf['Location'] / xdf['Total_Locations'])
    .groupby(['Key', 'Value'], as_index=False).agg({'Location_Rate': 'sum'})
)
    • 注意:**长的方法链使你的代码更难调试,也使其他人更难理解你试图实现的目标。
    • 注2:**方法pandas.DataFrame.assign允许您引用链中上一步的列。例如,在创建Location_Rate列时,第二个.assign引用Total_Locations

相关问题