我有一个大的序列文件存储文档的tfidf值。每行表示行,列是每个项的tfidfs值(行是稀疏向量)。我想使用hadoop为每个文档挑选前k个单词。最简单的解决方案是循环遍历Map器中每一行的所有列并选择top-k,但是随着文件越来越大,我认为这不是一个好的解决方案。在hadoop中有更好的方法吗?
lnxxn5zx1#
1. In every map calculate TopK (this is local top K for each map) 2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.
1. In every map calculate TopK (this is local top K for each map)
2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.
把问题想成
1. You have been given the results of X number of horse races. 2. You need to find Top N fastest horse.
1. You have been given the results of X number of horse races.
2. You need to find Top N fastest horse.
1条答案
按热度按时间lnxxn5zx1#
把问题想成