pandas panda Dataframe 中的多列分组

kqhtkvqz  于 2023-02-07  发布在  其他
关注(0)|答案(1)|浏览(191)

我有一个 Dataframe ,如下所示:

page    reference       ids                 -         subject           word
1       apple           ['aaaa', 'bbbbb', 'cccc']       name            app
1       apple           ['bndv', 'asasa', 'swdsd']      fruit           is
1       apple           ['bsnm', 'dfsd', 'dgdf']        fruit           text
1       bat             ['asas', 'ddfgd', 'ff']         thing           sport
1       cat             ['sds', 'dffd', 'gdg']          fruit           color
1       bat             ['sds', 'fsss', 'ssfd']         thing           was
1       bat             ['fsf', 'sff', 'fss']           place           that
2       dog             ['fffds', 'gd', 'sdg']          name            mud
2       egg             ['dfff', 'sdf', 'vcv']          place           gun
2       dog             ['dsfd', 'fds', 'gfdg']         thing           kit
2       egg             ['ddd', 'fg', 'dfg']            place           hut

我想按引用列和主题列进行groupby。输出应该如下所示:

output:
page    reference   ids                                                subject          word
1       apple   [['bndv', 'asasa', 'swdsd'],['bsnm', 'dfsd', 'dgdf']]   fruit           [[is], [text]]
1       apple   ['aaaa', 'bbbbb', 'cccc']                               name            [app]
1       bat     [['asas', 'ddfgd', 'ff'], [['sds', 'fsss', 'ssfd']]     thing           [[sport], [was]]
1       bat     ['fsf', 'sff', 'fss']                                   place           [that]
1       cat     ['sds', 'dffd', 'gdg']                                  fruit           [color]
2       dog     ['fffds', 'gd', 'sdg']                                  name            [mud]
2       dog     ['dsfd', 'fds', 'gfdg']                                 thing           [kit]
2       egg     [['dfff', 'sdf', 'vcv'], ['ddd', 'fg', 'dfg']]          place           [[gun], [hut]]
fdbelqdn

fdbelqdn1#

首先分组和聚合必要的字段:

res = df.groupby(["reference", "subject"]).agg({"page": min, "ids": list, "word": lambda l: [[ll] for ll in l]}).reset_index

  reference subject  page                                         ids              word
0     apple   fruit     1  [[bndv, asasa, swdsd], [bsnm, dfsd, dgdf]]    [[is], [text]]
1     apple    name     1                       [[aaaa, bbbbb, cccc]]           [[app]]
2       bat   place     1                           [[fsf, sff, fss]]          [[that]]
3       bat   thing     1      [[asas, ddfgd, ff], [sds, fsss, ssfd]]  [[sport], [was]]
4       cat   fruit     1                          [[sds, dffd, gdg]]         [[color]]
5       dog    name     2                          [[fffds, gd, sdg]]           [[mud]]
6       dog   thing     2                         [[dsfd, fds, gfdg]]           [[kit]]
7       egg   place     2          [[dfff, sdf, vcv], [ddd, fg, dfg]]    [[gun], [hut]]

请注意,这也将每个word值 Package 在一个列表中,就像您希望在所需的输出中所做的那样。我也只是假设在每个组中取最小的page值,因为您没有提到该变量的规则。您可以将agg函数中的min值更新为您认为合适的任何值。
然后,如果length为1,则可以清除列表:

res["word"] = res["word"].apply(lambda l: l[0] if len(l) == 1 else l)
res["ids"] = res["ids"].apply(lambda l: l[0] if len(l) == 1 else l)

  reference subject  page                                         ids              word
0     apple   fruit     1  [[bndv, asasa, swdsd], [bsnm, dfsd, dgdf]]    [[is], [text]]
1     apple    name     1                         [aaaa, bbbbb, cccc]             [app]
2       bat   place     1                             [fsf, sff, fss]            [that]
3       bat   thing     1      [[asas, ddfgd, ff], [sds, fsss, ssfd]]  [[sport], [was]]
4       cat   fruit     1                            [sds, dffd, gdg]           [color]
5       dog    name     2                            [fffds, gd, sdg]             [mud]
6       dog   thing     2                           [dsfd, fds, gfdg]             [kit]
7       egg   place     2          [[dfff, sdf, vcv], [ddd, fg, dfg]]    [[gun], [hut]]

相关问题