如何优化apriori算法？

jgwigjjp 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(522)

利用hadoop中的map-reduce框架实现了apriori算法。
谁能指导我如何优化apriori算法（在hadoop map reduce中）？
我会非常感激的。
谢谢！
编辑代码：

//MAPPER 
public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    Utils.count++;
    String line = value.toString();
    String[] items = line.split(" ");
    Arrays.sort( items );
    LinkedHashSet myPowerSet = powerset(items);
    for (Iterator iterator = myPowerSet.iterator(); iterator.hasNext();) {
        Object i = iterator.next();
        String _key = i.toString().replaceAll("\\[|\\]| +", "");
        context.write(new Text(_key), new IntWritable(1));
    }
}
//COMBINER
public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
    int localSum = 0;
    for (IntWritable value : values) {
        localSum += value.get();
    }
    context.write(key, new IntWritable(localSum));
}
//REDUCER
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
    int minSupportCount = 3;
    int supportCount = 0;
    for(IntWritable value : values) {
        supportCount += value.get();
    }
    if (supportCount >= minSupportCount) {
        context.write(key, new IntWritable(supportCount));  
    }
}

hadoop mapreduce Algorithm data-mining apriori

来源：https://stackoverflow.com/questions/41365650/how-to-optimize-apriori-algorithm

2条答案

按热度按时间

vptzau2j1#

实际上，apriori算法是挖掘频繁项集最慢的算法之一。此后，许多算法被提出，如eclat、fpgrowth、h-mine和lcm。其中一些比apriori快1000多。因此，优化apriori并不是真正有用的，因为它有一些基本的问题。最好是简单地从apriori转换到另一种更快的算法，比如lcm或fpgrowth。
但在您的代码中，似乎它并不是真正的先验知识。如果你想看到用java实现apriori的优化版本和更快的算法，比如hmine和fpgrowth，你可以查看用java实现的spmf软件（我是它的创始人）。它是提供最多项目集和模式挖掘算法实现（超过100个）的软件，并且是开源的。

赞(0）回复(0）举报 2021-05-30

4c8rllxm2#

首先：

你发布的代码不是先验的

它缺乏先验的所有重要思想。与其做这些聪明的优化，不如做一个非常昂贵的物化，这将使数据消耗成倍增加。别这样。
避免： LinkedHashSet （非常慢）
powerset（使用real apriori算法，避免powerset！）
非类型化迭代器（使用泛型）
正则表达式（慢，特别是在未预编译时）
不必要的物化（洗牌成本）
再造 IntWritable （垃圾收集费）
作为初学者，请尝试分析您的应用程序。还将其与elki和spmf中已知的良好实现进行比较。在这些工具中，您能处理的最大数据集是什么（在单个核心上；也可以尝试fpgrowth）与您的代码（在集群上）进行比较。如果这些工具可以在单个cpu上处理比代码大10000倍的数据，我也不会感到惊讶。

赞(0）回复(0）举报 2021-05-29

我来回答

如何优化apriori算法？

2条答案

你发布的代码不是先验的

相关问题

热门标签

最新问答