在列表中按输入值排序表和按频率排序,这是一个棘手的问题

f0ofjuux  于 2022-09-21  发布在  其他
关注(0)|答案(1)|浏览(128)

我有一些数据集,它有两列:ID和SEQUENCEL_RESULT。

Dataframe 如下所示,列1d0d1e已经用literal_eval求值:

id  list_of_sequencies
2   [(74, [1-1]), (51, [1-1, 0-47]), (23, [1-2]), (18, [1-2, 0-46]), (10, [0-1, 1-1]), (9, [0-1, 1-1, 0-46]), (9, [1-1, 0-46]), (6, [1-3]), (5, [0-2, 1-1]), (5, [1-1, 0-45])]
3   [(61, [1-1]), (24, [1-2]), (18, [0-1, 1-1]), (14, [1-8]), (14, [1-8, 0-40]), (12, [1-3]), (12, [1-6]), (11, [1-1, 0-47]), (10, [0-2, 1-1]), (10, [1-2, 0-46]), (2, [0-1, 1-1, 0-46])]   
4   [(frequency,[pattern-A,pattern-B,pattern-C,...]),(...),...]
...

每个序列列表如下所示,每个元组包含一个频率和一个列表。

[
    (269, [1 - 5]),
    (260, [1 - 5, 0 - 40]),
    (171, [0 - 3, 1 - 5]),
    (167, [0 - 3, 1 - 5, 0 - 40]),
    (162, [1 - 1]),
    (105, [1 - 1, 0 - 40]),
    (105, [1 - 6]),
    (86, [1 - 1, 1 - 5]),
    (84, [1 - 1, 1 - 5, 0 - 40]),
    (83, [1 - 6, 0 - 39]),
]
or 
[
    (178, ["1-9"]),
    (140, ["1-9", "0-39"]),
    (102, ["1-10"]),
    (87, ["1-10", "0-38"]),
    (75, ["1-1"]),
    (53, ["1-8"]),
    (50, ["0-1", "1-1"]),
    (35, ["1-8", "0-40"]),
    (32, ["1-9", "1-1"]),
    (30, ["1-1", "0-36"]),
]

如何制作一个函数,让我可以很容易地根据内部列表的数量对它们进行排名?就像我输入一个序列:[0-1, 1-1, 0-46]一样,该函数可以找到我输入的所有匹配项,并根据频率进行排名。那么结果表应该类似于[2,3],因为[0-1, 1-1, 0-46]在id=2中出现9次,在id=3中出现2次。

正如@mozway所要求的。生品

{'id': ['1', '2', '3', '4', '5'],
 'list_of_sequencies': ["[(8, ['1-1']), (4, ['0-3', '1-1']), (2, ['0-4', '1-1']), (2, ['1-2']), (1, ['1-1', '0-3']), (1, ['1-1', '0-41']), (1, ['1-1', '0-42']), (1, ['1-1', '0-43']), (1, ['1-1', '0-44']), (1, ['1-1', '0-45'])]",
  "[(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2', '1-1']), (4, ['1-1', '1-1']), (3, ['0-4', '1-1']), (3, ['1-1', '0-4']), (3, ['1-1', '0-4', '1-1']), (3, ['1-1', '0-40']), (3, ['1-1', '0-46']), (3, ['1-3'])]",
  "[(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1']), (4, ['1-2', '0-46']), (3, ['1-1', '0-42']), (3, ['1-3']), (2, ['1-1', '0-40']), (2, ['1-1', '0-41']), (2, ['1-1', '0-47']), (2, ['1-1', '1-1'])]",
  "[(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['1-2']), (18, ['1-2', '0-46']), (10, ['0-1', '1-1']), (9, ['0-1', '1-1', '0-46']), (9, ['1-1', '0-46']), (6, ['1-3']), (5, ['0-2', '1-1']), (5, ['1-1', '0-45'])]",
  "[(178, ['1-9']), (140, ['1-9', '0-39']), (102, ['1-10']), (87, ['1-10', '0-38']), (75, ['1-1']), (53, ['1-8']), (50, ['0-1', '1-1']), (35, ['1-8', '0-40']), (32, ['1-9', '1-1']), (30, ['1-1', '0-36'])]"]}

如果我的输入是:['0-1', '1-1'],结果将如下所示,并且顺序完全相同,如下所示:

ID 5包含:(50,[‘0-1’,‘1-1’])

ID 4:(10,[‘0-1’,‘1-1’])

ID 2:(5,[‘0-1’,‘1-1’])

ID 3:(4,[‘0-1’,‘1-1’])

{'id': ['5', '4', '2', '3', and their list_of_sequencies (don't want copy it) }
ubof19bj

ubof19bj1#

您可以使用列表理解来筛选所需的项目,并对其频率求和,然后对数据进行排序:

from ast import literal_eval

target = ['0-1', '1-1']
df['count'] = [sum(x[1] == target for x in literal_eval(s))
               for s in df['list_of_sequencies']]

out = df.query('count > 0').sort_values(by='count', ascending=False)

输出:

id                                 list_of_sequencies  count
1  2  [(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2...      1
2  3  [(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1...      1
3  4  [(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['...      1
4  5  [(178, ['1-9']), (140, ['1-9', '0-39']), (102,...      1

考虑频率

from ast import literal_eval

target = ['0-1', '1-1']
df['count'] = [sum(x[0] for x in literal_eval(s)
                  if x[1] == target)
               for s in df['list_of_sequencies']]

out = df.query('count > 0').sort_values(by='count', ascending=False)

输出:

id                                 list_of_sequencies  count
4  5  [(178, ['1-9']), (140, ['1-9', '0-39']), (102,...     50
3  4  [(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['...     10
1  2  [(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2...      5
2  3  [(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1...      4

相关问题