pandas 如何按列表的元素分组

hfsqlsce  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(104)

我有一个类似这样的dataframe:

81883       2011000011  ...  [South Sturgeon, Creek]
81884       2011000022  ...        [Meadowood]
81885       2011000016  ...   [South, Portage]
81886       2011000011  ...  [North Sturgeon, Creek]

我想从最后一列(名为Locations)开始按具有公共单词的行分组(单词是Locations列中被','拆分的值):例如,在上面提到的例子中,我想groupby Creek,当没有找到常用词时,行将保持原样(或者更好地连接为字符串),我尝试使用:

def get_grp(list_current_row, df,column_location): 
    rows_index_to_groupby = [] 
    for string_element in list_current_row: 
        for idx,row in enumerate (df[column_location].values): 
            if row != list_current_row and string_element in row: 
                rows_index_to_groupby.append(idx) 
    return rows_index_to_groupby

 grouped_dataframe = resulting_dataframe.groupby(lambda x: [resulting_dataframe[column_location][i] for i in get_grp(x, resulting_dataframe,column_location)] )

期望的输出将是:

Locations
Creek             0  Creek       81886       2011000011  ...
                  1  Creek       81883       2011000011  ...
South, Portage    2  South, Portage      81885       2011000016  ...
Meadowood         3  Meadowood       81884    2011000022
3phpmpom

3phpmpom1#

虽然不完全是要求的,但下面的可能会得到你想要的。这只是提取位置值的最后一个元素并将其分配给索引:

import pandas as pd

df = pd.DataFrame({
    'number': [81883, 81884, 81885, 81886], 
    'date': ["2011000011", "2011000022", "2011000016", "2011000011"],
    'location': [["South Sturgeon", "Creek"], ["Meadowood"], ["South", "Portage"], ["North Sturgeon", "Creek"]],
})

df.index = df.location.str[-1]
print(df)

产量

number        date                 location
location                                              
Creek       81883  2011000011  [South Sturgeon, Creek]
Meadowood   81884  2011000022              [Meadowood]
Portage     81885  2011000016         [South, Portage]
Creek       81886  2011000011  [North Sturgeon, Creek]

现在你可以简单地得到所有的Creek条目,例如:

df.loc['Creek']

由于索引与列具有相同的名称“location”,因此您可能需要重命名索引:

df.index.names = ['primary_location']

对于分组操作,您可以执行例如

df.groupby('primary_location')['number'].sum()

primary_location
Creek       163769
Meadowood    81884
Portage      81885
Name: number, dtype: int64

相关问题