我一直在尝试将两个RDD合并到平均点1和kPoint2以下。它不断抛出这个错误
ValueError: Can not deserialize RDD with different number of items in pair: (2, 1)
我尝试了很多方法,但是我不能确定两个RDD是相同的,有相同的分区数。我的下一步是在两个列表上应用欧几里德距离函数来测量差异,所以如果有人知道如何解决这个错误或者有不同的方法,我会非常感激。
提前谢谢
averagePoints1 = averagePoints.map(lambda x: x[1])
averagePoints1.collect()
Out[15]:
[[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
kpoints2 = sc.parallelize(kpoints,4)
In [17]:
kpoints2.collect()
Out[17]:
[[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
2条答案
按热度按时间piok6c0g1#
对于未来的搜索者来说,这是我在最后遵循的解决方案
gr8qqesn2#
检查这个答案在pyspark中合并两个rdd