我开发了Kmeans模型,下面是我的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import pandas as pd
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
vectorizer = TfidfVectorizer(stop_words='english',max_features= 10)
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
字符串
vocab列表是:
vectorizer.get_feature_names_out()
array(['computer', 'eps', 'graph', 'human', 'interface', 'minors',
'survey', 'time', 'trees', 'user'], dtype=object)
型
并且簇质心的坐标是:
print (model.cluster_centers_)
[[0.21866736 0.26189498 0. 0.25689141 0.23594367 0.
0.10319731 0.25412544 0. 0.32570307]
[0. 0. 0.4448599 0. 0. 0.30832905
0.15059203 0. 0.56392448 0. ]]
型
对于获取每个聚类中的前n个单词:
print (model.cluster_centers_.argsort()[:, ::-1])
[[9 1 3 7 4 0 6 8 5 2]
[8 2 5 6 9 7 4 3 1 0]]
型
我的问题是为什么我们使用.argsort()[:, ::-1]
而不仅仅是.argsort()
,因为我们希望单词的索引与质心的距离最短,而argsort
默认为升序?
举例来说:在给定的数组[9 1 3 7 4 0 6 8 5 2]
中,索引9(word: user
)和(coordinate value: 0.32570307
)离质心最远,而索引2(word: graph
)和(coordinate value: 0
)离质心最近。因此,实际格式应为[2 5 8 6 0 4 7 3 1 9]
。
1条答案
按热度按时间jm2pwxwz1#
你误解了
model.cluster_centers_
。这些值不是距离,而是tf-idf值。在聚类之后,质心表示每个聚类的每个词的平均tf-idf值。tf-idf值越高,字对于该簇越重要。
使用
argsort()[:, ::-1]
从大到小排序。