把Pandas的列表组合在一起

yxyvkwin  于 2023-01-24  发布在  其他
关注(0)|答案(2)|浏览(155)

我有一个引用其他专利的专利数据库,看起来像这样:

{'index': {0: 0, 1: 1, 2: 2, 12: 12, 21: 21},
 'docdb_family_id': {0: 57904406,
  1: 57904406,
  2: 57906556,
  12: 57909419,
  21: 57942222},
 'cited_docdbs': {0: [15057621,
   16359315,
   18731820,
   19198211,
   19198218,
   19198340,
   19550248,
   19700609,
   20418230,
   22144166,
   22513333,
   22800966,
   22925564,
   23335606,
   23891186,
   25344297,
   25345599,
   25414615,
   25495423,
   25588955,
   26530649,
   27563473,
   34277948,
   36626718,
   38801947,
   40454852,
   40885675,
   40957530,
   41249600,
   41377563,
   41378429,
   41444278,
   41797413,
   42153280,
   42340085,
   42340086,
   42678557,
   42709962,
   42709963,
   42737942,
   43648036,
   44691991,
   44947081,
   45352855,
   45815534,
   46254922,
   46382961,
   47830116,
   49676686,
   49912209,
   54191614],
  1: [15057621,
   16359315,
   18731820,
   19198211,
   19198218,
   19198340,
   19550248,
   19700609,
   20418230,
   22144166,
   22513333,
   22800966,
   22925564,
   23335606,
   23891186,
   25344297,
   25345599,
   25414615,
   25495423,
   25588955,
   26530649,
   27563473,
   34277948,
   36626718,
   38801947,
   40454852,
   40885675,
   40957530,
   41249600,
   41377563,
   41378429,
   41444278,
   41797413,
   42153280,
   42340085,
   42340086,
   42678557,
   42709962,
   42709963,
   42737942,
   43648036,
   44691991,
   44947081,
   45352855,
   45815534,
   46254922,
   46382961,
   47830116,
   49676686,
   49912209,
   54191614],
  2: [6078355,
   8173164,
   14235835,
   16940834,
   18152411,
   18704525,
   27343995,
   45467248,
   46172598,
   49878759,
   50995553,
   52668238],
  12: [6293366,
   7856452,
   16980051,
   23177359,
   26477802,
   27453602,
   41135094,
   53004244,
   54332594,
   55018863],
  21: [7913900,
   13287798,
   18834564,
   23971781,
   26904791,
   27304292,
   29720924,
   34622252,
   35197847,
   37766575,
   39873073,
   42075013,
   44508652,
   44530218,
   45571357,
   48222848,
   48747089,
   49111776,
   49754218,
   50024241,
   50474222,
   50545849,
   52580625,
   58800268]},
 'doc_std_name': {0: 'SEEO INC',
  1: 'BOSCH GMBH ROBERT',
  2: 'SAMSUNG SDI CO LTD',
  12: 'NAGAI TAKAYUKI',
  21: 'SAMSUNG SDI CO LTD'}}

现在,我想做的是执行一个groupby公司如下:

df_grouped_byfirm=data_min.groupby("doc_std_name").agg(publn_nrs=('docdb_family_id',"unique")).reset_index()

所以,例如在上面的例子中,对于SAMSUNG SDI CO LTD,最终的引用_docdb列表应该成为一个大列表,其中SAMSUNG SDI CO LTD的两个id的所有引用的docdb被合并在一起:

[6078355,
   8173164,
   14235835,
   16940834,
   18152411,
   18704525,
   27343995,
   45467248,
   46172598,
   49878759,
   50995553,
   52668238,
7913900,
   13287798,
   18834564,
   23971781,
   26904791,
   27304292,
   29720924,
   34622252,
   35197847,
   37766575,
   39873073,
   42075013,
   44508652,
   44530218,
   45571357,
   48222848,
   48747089,
   49111776,
   49754218,
   50024241,
   50474222,
   50545849,
   52580625,
   58800268]

谢谢

oo7oh9g9

oo7oh9g91#

您可以使用dict.fromkeys展开嵌套列表,以便按原始顺序删除重复项:

f = lambda x: list(dict.fromkeys(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))

print (df)
                                                            publn_nrs
doc_std_name                                                         
BOSCH GMBH ROBERT   [15057621, 16359315, 18731820, 19198211, 19198...
NAGAI TAKAYUKI      [6293366, 7856452, 16980051, 23177359, 2647780...
SAMSUNG SDI CO LTD  [6078355, 8173164, 14235835, 16940834, 1815241...
SEEO INC            [15057621, 16359315, 18731820, 19198211, 19198...

如果顺序不重要,则使用set s删除重复项:

f = lambda x: list(set(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))

print (df)
                                                            publn_nrs
doc_std_name                                                         
BOSCH GMBH ROBERT   [19700609, 19198211, 19198340, 44947081, 19198...
NAGAI TAKAYUKI      [27453602, 7856452, 26477802, 23177359, 550188...
SAMSUNG SDI CO LTD  [48222848, 18834564, 42075013, 58800268, 18704...
SEEO INC            [19700609, 19198211, 19198340, 44947081, 19198...
ilmyapht

ilmyapht2#

您可以只在agg中使用sum来连接每个组中的列表。

df.groupby("doc_std_name").agg({"cited_docdbs": sum}).reset_index()

这将给予以下内容:

doc_std_name                                       cited_docdbs
0   BOSCH GMBH ROBERT  [15057621, 16359315, 18731820, 19198211, 19198...
1      NAGAI TAKAYUKI  [6293366, 7856452, 16980051, 23177359, 2647780...
2  SAMSUNG SDI CO LTD  [6078355, 8173164, 14235835, 16940834, 1815241...
3            SEEO INC  [15057621, 16359315, 18731820, 19198211, 19198...

相关问题