我有一个引用其他专利的专利数据库,看起来像这样:
{'index': {0: 0, 1: 1, 2: 2, 12: 12, 21: 21},
'docdb_family_id': {0: 57904406,
1: 57904406,
2: 57906556,
12: 57909419,
21: 57942222},
'cited_docdbs': {0: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
1: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
2: [6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238],
12: [6293366,
7856452,
16980051,
23177359,
26477802,
27453602,
41135094,
53004244,
54332594,
55018863],
21: [7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]},
'doc_std_name': {0: 'SEEO INC',
1: 'BOSCH GMBH ROBERT',
2: 'SAMSUNG SDI CO LTD',
12: 'NAGAI TAKAYUKI',
21: 'SAMSUNG SDI CO LTD'}}
现在,我想做的是执行一个groupby公司如下:
df_grouped_byfirm=data_min.groupby("doc_std_name").agg(publn_nrs=('docdb_family_id',"unique")).reset_index()
所以,例如在上面的例子中,对于SAMSUNG SDI CO LTD,最终的引用_docdb列表应该成为一个大列表,其中SAMSUNG SDI CO LTD的两个id的所有引用的docdb被合并在一起:
[6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238,
7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]
谢谢
2条答案
按热度按时间oo7oh9g91#
您可以使用
dict.fromkeys
展开嵌套列表,以便按原始顺序删除重复项:如果顺序不重要,则使用
set
s删除重复项:ilmyapht2#
您可以只在
agg
中使用sum
来连接每个组中的列表。这将给予以下内容: