如何在pandas框架中按多个列分组？

bkhjykvo 于 2023-09-29 发布在其他

关注(0)|答案(1)|浏览(140)

我正在处理来自BoardGameGeek的棋盘游戏数据，我想创建一个数据框架，将棋盘游戏按最小玩家数量和类别进行分组。
以下是列名称：['name'，'category'，' playtime'，'playtime_num'，'avg_rating'，'num_ratings'，'min_players'].
我首先基于“min_players”创建了一个名为“support_solo”的新列，用于指示棋盘游戏是否支持单人游戏：支持solo，不支持solo。
然后我创建了一个groupby对象：

grouped = popular.groupby(['support_solo', 'category'])

之后，我调用了基本的聚合函数，以获得每个类别中的游戏数量的细分，以及每个“solo/not solo组”中的游戏数量，以及其他字段（如游戏时间）的平均值。然而，我有麻烦得到的游戏与最多的收视率为每个类别。我使用了一个helper函数和一个所有groupby聚合的字典：

def game_with_highest_ratings(group):
    max_ratings_index = group['num_ratings'].idxmax()
    return group.loc[max_ratings_index, 'name']

aggregations = {
    'name': 'count', # total number of games in each category
    'num_ratings': game_with_highest_ratings, # game with the most ratings in each category
    'avg_rating': 'mean', # average rating of games in each category
    'playtime_num': 'mean', # average playtime of games in each category
}

grouped_result = grouped.agg(aggregations)

我继续得到KeyError：'num_ratings'，我不知道如何解决这个问题。我已经检查了正确的列名。如何解决此问题，或者是否有其他方法？

pandas

来源：https://stackoverflow.com/questions/77165640/how-do-i-groupby-multiple-columns-in-a-pandas-dataframe

1条答案

按热度按时间

mwecs4sa1#

agg只向函数传递一个Series，而不是完整的DataFrame。尝试在这里切片将在索引中搜索（group已经是1D）。您无法访问其他列。
最有效的解决方法可能是聚合idxmax并然后对其进行后处理。

aggregations = {
    'name': 'count', # total number of games in each category
    'num_ratings': 'idxmax', # game with the most ratings in each category
    'avg_rating': 'mean', # average rating of games in each category
    'playtime_num': 'mean', # average playtime of games in each category
}

grouped_result = grouped.agg(aggregations)
grouped_result['num_ratings'] = grouped_result['num_ratings'].map(popular['name'])

一个不那么干净的解决方案可能是在函数中使用硬编码的外部引用：

def game_with_highest_ratings(group):
    return popular.at[group.idxmax(), 'name']

赞(0）回复(0）举报 2023-09-29

我来回答

如何在pandas框架中按多个列分组？

1条答案

相关问题

热门标签

最新问答