mahout-canopy集群，k-means集群：java堆空间-内存不足

我在一个my集群上运行mahout 0.7，这个集群有30个节点（每个节点有8个核16g内存），试图聚集250000个sparsevector（300000个）。
如果我通过调整冠层参数（t1，t2）找到少量的冠层中心，效果会很好。
当超过一定数量的canopy中心时，作业不断失败，并在reduce阶段的67%处显示“error:java heap space”消息。
如果k值增加，k-means聚类也有同样的堆空间问题。
我听说树冠中心向量和k中心向量保存在每个Map器和减缩器的内存中。这将是canopy center（或k）x sparsevector（300000大小）的数量=足以容纳4g内存，这看起来并不太糟糕。
基于之前这里和其他地方的问题，我已经启动了我能找到的每一个记忆旋钮：
hadoop-env.sh：在namenode上将所有堆空间设置为16gb，在datanode上甚至8gb。
mapred-site.xml:添加mapred.{map，reduce}.child.java.opts属性，并将其值设置为-xmx4000m
mapred-site.xml:更改mapred.tasktracker.{map，reduce}.tasks.maximum属性，并将其值从8降低到4
问题还在持续。我在这上面撞了很久了——有人有什么建议吗？
完整的命令和输出如下所示：

public static void main(String [] args) throws Exception{

    String ratingsPath = args[0];
    String outputPath = args[1];
    String T1 = args[2];
    String T2 = args[3];

    Configuration conf = new Configuration();       

    HadoopUtil.delete(conf, new Path(outputPath));

    CanopyDriver.run(conf, new Path(ratingsPath), new Path(outputPath), new ManhattanDistanceMeasure(), 
            Double.parseDouble(T1), Double.parseDouble(T2), true, 0.0, false);

}

我面临的错误信息是：

Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing /MrBic/Output/SeedGeneration_predSample
at org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:363)
at org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:248)
at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:155)
at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:170)
at MrBicClusteringDriver.main(MrBicClusteringDriver.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

2013-06-12 10:56:00,825 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:560)
at org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:275)
at org.apache.mahout.clustering.canopy.Canopy.<init>(Canopy.java:43)
at org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:163)
at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:47)
at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:30)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

mahout-canopy集群，k-means集群：java堆空间-内存不足

暂无答案！

相关问题

热门标签

最新问答