hadoop中的java二次排序

cczfrluj 于 2021-05-29 发布在 Hadoop

关注(0)|答案(3)|浏览(731)

我正在从事一个hadoop项目，在多次访问各种博客和阅读文档之后，我意识到我需要使用hadoop框架提供的二次排序功能。
我的输入格式如下： DESC(String) Price(Integer) and some other Text 我希望减速机中的值是价格的降序。另外，在比较desc时，我有一个方法，它取两个字符串和一个百分比，如果两个字符串之间的相似度等于或大于这个百分比，那么我应该认为它们相等。
问题是在reduce作业完成后，我可以看到一些描述，这些描述与另一个字符串类似，但它们在不同的组中。
下面是我对复合键方法的比较

public int compareTo(VendorKey o) {
    int result =-
    result = compare(token, o.token, ":") >= percentage ? 0:1;
    if (result == 0) {
        return pid> o.pid  ?-1: pid < o.pid ?1:0;
    }
    return result;
}

比较分组比较器的方法

public int compare(WritableComparable a, WritableComparable b) {
    VendorKey one = (VendorKey) a;
    VendorKey two = (VendorKey) b;
    int result = ClusterUtil.compare(one.getToken(), two.getToken(), ":") >= one.getPercentage() ? 0 : 1;
    // if (result != 0)
    // return two.getToken().compareTo(one.getToken());
    return result;
}

Java hadoop mapreduce hadoop2 hadoop-partitioning

来源：https://stackoverflow.com/questions/38773248/secondary-sort-in-hadoop

3条答案

按热度按时间

waxmsbnn1#

洗牌过程有3个过程：分区、排序和分组。我猜你有多个减缩器，你的相似结果是由不同的减缩器处理的，因为它们在不同的分区中。
您可以将reducer的数量设置为1，或者为您的作业设置一个扩展org.apache.hadoop.mapreduce.partitioner的自定义分区器。

赞(0）回复(0）举报 2021-05-30

uemypmqf2#

在customwritable之后，给一个基本分区器一个复合键和一个空可写值。例如：

public class SecondarySortBasicPartitioner extends
    Partitioner<CompositeKeyWritable, NullWritable> {
    public int getPartition(CompositeKeyWritable key, NullWritable value,
            int numReduceTasks) {
        return (key.DEPT().hashCode() % numReduceTasks);
    }
}