java—使用分布式缓存获取和分发小型查找文件的最佳方法

f8rj6qna 于 2021-05-30 发布在 Hadoop

关注(0)|答案(1)|浏览(325)

获取分布式缓存数据的最佳方法是什么？

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    ArrayList<String> globalFreq = new ArrayList<String>();
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }
    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //Accessing "globalFreq" data .and do further processing
        }

或

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    URI[] cacheFiles
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        cacheFiles = DistributedCache.getCacheFiles(conf);

    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        ArrayList<String> globalFreq = new ArrayList<String>();
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }

        }

所以如果我们像（代码2）那样做，这意味着什么 Say we have 5 map task every map task reads the same copy of the data . 在为每个Map这样写的时候，任务会多次读取数据，对吗（5次）？
代码1：当它在设置中写入时，它被读取一次，全局数据在Map中被访问。
这是编写分布式缓存的正确方法。

Java hadoop mapreduce Caching distributed-cache

来源：https://stackoverflow.com/questions/25760810/best-way-to-get-distribute-a-small-lookup-file-using-distributed-cache

1条答案

按热度按时间

vnjpjtjt1#

在工作中尽你所能 setup 方法：这将由每个Map器调用一次，但随后将为传递给Map器的每个记录进行缓存。解析每个记录的数据是可以避免的开销，因为那里没有依赖于 key , value 以及 context 您在 map 方法。
这个 setup 方法将为每个Map任务调用，但是 map 将为该任务处理的每个记录调用（这显然是一个非常高的数字）。

赞(0）回复(0）举报 2021-05-30

我来回答

java—使用分布式缓存获取和分发小型查找文件的最佳方法

1条答案

相关问题

热门标签

最新问答