scipy 在python按成对距离进行的层次聚类中,我如何在特定的距离上进行切割,并得到聚类和每个聚类的成员列表?

xhv8bpkk  于 2022-11-10  发布在  Python
关注(0)|答案(2)|浏览(151)

我有这样的成对距离数据:

distances = {

('DN1357_i2', 'DN1357_i5'): 1.0,

('DN1357_i2', 'DN10172_i1'): 28.0,

('DN1357_i2', 'DN1357_i1'): 8.0,

('DN1357_i5', 'DN1357_i1'): 2.0,

('DN1357_i5', 'DN10172_i1'): 34.0,

('DN1357_i1', 'DN10172_i1'): 38.0,
}

所以我有4个对象,我使用以下代码行对这些对象进行了聚类:

keys = [sorted(k) for k in obj_distances.keys()]

values = obj_distances.values()

sorted_keys, distances = zip(*sorted(zip(keys, values)))

Z = linkage(distances)

labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendro = dendrogram(Z, labels=labels)

它给了我一个树状图。什么是代码来获得聚类和每个聚类中的对象名称,(如果我在距离2处切割树状图)?

7vhp5slm

7vhp5slm1#

您可以使用scipy函数cut_tree,下面是一个数据示例:

from scipy.cluster.hierarchy import cut_tree, dendrogram, linkage

obj_distances = {
    ('DN1357_i2', 'DN1357_i5'): 1.0,
    ('DN1357_i2', 'DN10172_i1'): 28.0,
    ('DN1357_i2', 'DN1357_i1'): 8.0,
    ('DN1357_i5', 'DN1357_i1'): 2.0,
    ('DN1357_i5', 'DN10172_i1'): 34.0,
    ('DN1357_i1', 'DN10172_i1'): 38.0,
}

keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

Z = linkage(distances)

labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendro = dendrogram(Z, labels=labels)

members = dendro['ivl']
clusters = cut_tree(Z, height=2)
cluster_ids = [c[0] for c in clusters]

for k in range(max(cluster_ids) + 1):
    print(f"Cluster {k}")
    for i, c in enumerate(cluster_ids):
        if c == k:
            print(f"{members[i]}")

    print('\n')

对于在高度为2处切割树,输出为:

Cluster 0
DN10172_i1

Cluster 1
DN1357_i1

Cluster 2
DN1357_i2
DN1357_i5
raogr8fs

raogr8fs2#

@列奥纳多Sirino的答案给了我正确的树状图,但错误的聚类结果(我还没有完全弄清楚原因)
如何复制我的声明:对映-取代obj_distances中的像素名称(DN1357_i2变成A,DN1357_i5变成B,DN10172_i1变成C,DN1357_i1变成D)
也就是说

obj_distances = {
    ("A", "B"): 1.0,
    ("A", "C"): 28.0,
    ("A", "D"): 8.0,
    ("B", "D"): 2.0,
    ("B", "C"): 34.0,
    ("D", "C"): 38.0,
}

本质上与问题中的obj_distances相同,但将每个实体相应地替换为A、B、C。这会打乱聚类结果,
群集0

  • C类
  • D级

第1组

  • A级

第二组

  • B

但根据树状图,A和B应该在一起:

下面是与树状图一致的正确聚类结果:

替换:

members = dendro['ivl']
clusters = cut_tree(Z, height=2)
cluster_ids = [c[0] for c in clusters]

for k in range(max(cluster_ids) + 1):
    print(f"Cluster {k}")
    for i, c in enumerate(cluster_ids):
        if c == k:
            print(f"{members[i]}")

    print('\n')

与此:

cluster_result = list(zip(labels, fcluster(Z, t=1, criterion="distance")))
dict(pd.DataFrame(cluster_result, columns=["user", "cluster_num"]).groupby("cluster_num").user.apply(list))

谢谢你@列奥纳多·西里诺的回答,让我走到了这一步!

相关问题