我想在我的基础上使用k均值(670万行和22个变量), base.dtypes
```
('anonimisation2', 'double'),
('anonimisation3', 'double'),
('anonimisation4', 'double'),
('anonimisation5', 'double'),
('anonimisation6', 'double'),
('anonimisation7', 'double'),
('anonimisation8', 'double'),
('anonimisation9', 'double'),
('anonimisation10', 'double'),
('anonimisation11', 'double'),
('anonimisation12', 'double'),
('anonimisation13', 'double'),
('anonimisation14', 'double'),
('anonimisation15', 'double'),
('anonimisation16', 'double'),
('anonimisation17', 'double'),
('anonimisation18', 'double'),
('anonimisation19', 'double'),
('anonimisation20', 'double'),
('anonimisation21', 'double'),
('anonimisation22', 'double')]
我读到我应该使用这个代码:
def transData(base):
return base.rdd.map(lambda r: [Vectors.dense(r[:-1])]).toDF(['features'])
transformed= transData(base)
transformed.show(5, False)
然后我写了这个:
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(transformed)
我有个错误:
IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array,values:array>, array, array] but was actually of type struct<type:tinyint,size:int,indices:array,values:array>.'
不知道该怎么办?如果你想知道更多的信息,就问谢谢
我试着继续使用python,但我也遇到了一些问题
1条答案
按热度按时间gfttwv5a1#
使用
from pyspark.ml.linalg import Vectors
而不是from pyspark.mllib.linalg import Vectors